Alert Management (xMatters)

Field engineers not all the time look at events list to oversee faults. Delay between appearance of event in fault management system (or ticket creation in trouble ticketing system) and starting to work on it can lead to increase of network downtime or decrease of service quality.

Some faults do not impact on services at time of emergence, but if they stay unresolved - situation can lead to severe network / services outage.

Manual alerting by phone, SMS or email is labor intensive and can be effective only for limited number of critical alerts.

When number of cases that requires alerting increases, customers start to use simple scripts. Such approach really help to manage bigger number of ongoing alerts. However, how to be sure that engineer has taken responsibility for fault resolution?

Second issue with simple scripting: it does not take into account duty hours and fine granularity of responsibility. When number of alerts continues to grow, engineers just start ignoring them as spam. It leads to missing critical situations.

How does Alert Management solution work? An example is shown below.

1. Events arrive into the Fault Management System (e.g. Netcool).
2. Correlation Engine selects events that have impact on services.
3. Correlation Engine identifies responsible group of engineers (using information about location and technology domain from event) and generates event relevant subject and body of message.
4. xMatters sends two-way (require response) or one-way notifications according to shift schedule via SMS Centre or Email server. Other communication (voice, jabber, …) also available.
5. xMatters delivers feedback from engineer (“acknowledge” or “reject”) back into fault management or trouble ticketing system. Therefore, NOC engineers can see in EventList the name of engineer who has taken responsibility for alert.
6. In case xMatters receives no timely response from engineers, it performs escalation - sends alerts to managers.

As a result:
1. Each important network event has responsible engineer.
2. No spam alerts.
3. Fast and efficient delivery of alerts.