Grouping Similar Alerts


#1

We often get a number of associated alerts for what amounts to a single issue. For example, we have a service called UPG that is exposed to the internet using four HAProxy servers (two external on the DMZ and 2 internal in a CDZ). These are primary and failover HAProxy servers. If the UPG service goes down then I get 8 separate alerts that look like the following …

  1. hap01.dmz -> Service upg is down on upg_cluster
  2. hap01.dmz -> BACKEND is down on upg_cluster
  3. hap02.dmz -> Service upg is down on upg_cluster
  4. hap02.dmz -> BACKEND is down on upg_cluster
  5. hap01.cdz -> Server upg01 is down on upg_cluster
  6. hap01.cdz -> BACKEND is down on upg_cluster
  7. hap02.cdz -> Server upg01 is down on upg_cluster
  8. hap02.cdz -> BACKEND is down on upg_cluster

NOTE: BACKEND alerts because there is currently only one active server in the cluster. If there were more than one and the other servers took over then BACKEND would not be down and would not be included in the alerts. We would still get four alerts though.

In reality all 8 of these alerts are because one service went down on one server.

Question: Is there a way to intelligently group these alerts so that OpsGenie only sends a single alert?

How these alerts are sent
We use the TICK stack from InfluxData so we define the alert criteria in Kapacitor which actually generates TICKScripts. The logic in the TICKScript simply says that if the difference of the HAProxy DOWNTIME metric is greater than zero for the last 30 seconds then send an alert to OpsGenie. HAProxy maintains a DOWNTIME counter which increments each time it determines a service is down so we are really saying if the downtime rises over a 30 second period then it is down and needs to be addressed.

Even though two of the four HAProxy servers are for failover they still actively monitor the services for downtime (Telegraf collects those metrics every 10 seconds and saves them to InfluxDB).

NOTE: If one of those UPG servers is actually rebooted then we would get 9 alerts because we would also get a DEADMAN alert for the server itself going down.

Any advice would be appreciated.


#2

Gonna reply to my own post because I just found a resource at https://www.opsgenie.com/incident-response-orchestration that uses the term " Alert Clustering". I did not find specific details about how that works or what “plan” I need to be own to utilize that but it does sound like the correct term to be searching for.


#3

Hey! I think you found the right feature but this document can give you a more clear idea on how that works- https://docs.opsgenie.com/v1.0/docs/correlate-alerts-with-incident-1 . Also, for your case, there is a chance that alert de-duplication might solve your problem - https://docs.opsgenie.com/docs/alert-deduplication.


#4

Thanks @serhat. Both articles are relevant and look like what I need. Aliases would help me on the creation of the alerts and Incidents help after getting the alerts. I’ll definitely look at how I tag my measurements and see if I can make better use of the alias option during alert creation! Appreciate this info!