Service Disruption - Network monitoring
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - Network Monitoring Interruption to Services

Root Cause Analysis

Duration of incident

Discovered: Mar 14, 2024 18:41 - UTC
Resolved: Mar 14, 2024 00:15 - UTC

Cause

After testing in a stage environment, a change was deployed to the Auvik production environment.

Effect

The modifications in the scaled-up production environment caused the system to overload and stop processing the streaming monitoring data.

Action taken

All times in UTC

03/14/2024

13:55 - Changes to production were introduced.

14:35 - Attention was raised to Auvik engineering as the systems for TrafficInsights, Syslog, and integrations began to crash loop.

14:40 - It was determined the issues raised in the initial report were more widespread than initially thought. The system as a whole was being overloaded with data.

14:45 - An incident was raised, and resources were called in to address it.

15:03 - Replication topic services were taken offline to reduce system load.

15:20-16:20 - An engineering team attempts to remove the additional data created by the overload.

16:20 - Engineering attempts to run commands to bulk remove the extraneous data from the system.

16:25- 17:30 - Engineering waits for the system to process the commands. Additional resources are added to the system to provide resources to process the load.

17:30 - The first cluster has now recovered.

17:50 - Engineering is seeing improvement across the system, with several other clusters starting to come back online.

18:35 - The bulk changes have the desired effect, and Services are starting to recover. Engineering begins by going through each service to validate health and functionality.

03/14/2024 - 03/15/2024

18:35 - 00:15 - Engineering continues working through all affected services and troubleshoots any unresolved issues. Temporary resources are added to speed things along.

03/15/2024

00:15 - The incident is declared closed.

Future consideration(s)

  • Auvik will better understand the effect that new software used in the system may have on performance and system load.
  • Auvik has scheduled the replacement of soon to be the end of the support software the system currently relies on to address a bug discovered during the incident post-mortem.
  • Auvik will better define its internal alerting to force relevance on the alerting and emphasize actual issues rather than “noise.”
Posted Mar 25, 2024 - 10:57 EDT

Resolved
The service disruption with the clusters and services in production has recovered. Services are working as they should, and the production environment is currently working as expected.

A Root Cause Analysis (RCA) will follow after a full review.
Posted Mar 14, 2024 - 20:13 EDT
Update
We’ve identified the source of the service disruption and most clusters and services have recovered. We are continuing to monitor the remaining data processing jobs and will provide an update in approximately 1 hour.
Posted Mar 14, 2024 - 18:55 EDT
Monitoring
We’ve identified the source of the service disruption affecting, but not limited to, Maps, TrafficInsights, Alerts, and product UI across all clusters. We are are continuing to monitoring the system as clusters and services continue to recover. We’ll keep you posted on a resolution.
Posted Mar 14, 2024 - 17:49 EDT
Update
We’ve identified the source of the service disruption affecting, but not limited to, Maps, TrafficInsights, Alerts, and product UI across all clusters. We are are continuing to work to restore service as quickly as possible.
Posted Mar 14, 2024 - 17:00 EDT
Identified
We’ve identified the source of the service disruption affecting, but not limited to, Maps, TrafficInsights, Alerts, and product UI. We are working to restore service as quickly as possible.
Posted Mar 14, 2024 - 15:56 EDT
Update
We are continuing to investigate this issue.
Posted Mar 14, 2024 - 14:50 EDT
Investigating
We are continuing to investigate. We will continue to provide updates as they become available.
Posted Mar 14, 2024 - 14:50 EDT
This incident affected: Network Mgmt (my.auvik.com, us1.my.auvik.com, us2.my.auvik.com, us3.my.auvik.com, us4.my.auvik.com, eu1.my.auvik.com, eu2.my.auvik.com, au1.my.auvik.com, ca1.my.auvik.com, us5.my.auvik.com) and Auvik TrafficInsights.