We often get a number of associated alerts for what amounts to a single issue. For example, we have a service called UPG that is exposed to the internet using four HAProxy servers (two external on the DMZ and 2 internal in a CDZ). These are primary and failover HAProxy servers. If the UPG service goes down then I get 8 separate alerts that look like the following …
- hap01.dmz -> Service upg is down on upg_cluster
- hap01.dmz -> BACKEND is down on upg_cluster
- hap02.dmz -> Service upg is down on upg_cluster
- hap02.dmz -> BACKEND is down on upg_cluster
- hap01.cdz -> Server upg01 is down on upg_cluster
- hap01.cdz -> BACKEND is down on upg_cluster
- hap02.cdz -> Server upg01 is down on upg_cluster
- hap02.cdz -> BACKEND is down on upg_cluster
NOTE: BACKEND alerts because there is currently only one active server in the cluster. If there were more than one and the other servers took over then BACKEND would not be down and would not be included in the alerts. We would still get four alerts though.
In reality all 8 of these alerts are because one service went down on one server.
Question: Is there a way to intelligently group these alerts so that OpsGenie only sends a single alert?
How these alerts are sent
We use the TICK stack from InfluxData so we define the alert criteria in Kapacitor which actually generates TICKScripts. The logic in the TICKScript simply says that if the difference of the HAProxy DOWNTIME metric is greater than zero for the last 30 seconds then send an alert to OpsGenie. HAProxy maintains a DOWNTIME counter which increments each time it determines a service is down so we are really saying if the downtime rises over a 30 second period then it is down and needs to be addressed.
Even though two of the four HAProxy servers are for failover they still actively monitor the services for downtime (Telegraf collects those metrics every 10 seconds and saves them to InfluxDB).
NOTE: If one of those UPG servers is actually rebooted then we would get 9 alerts because we would also get a DEADMAN alert for the server itself going down.
Any advice would be appreciated.