Discovered: Mar 14, 2024 18:41 - UTC
Resolved: Mar 14, 2024 00:15 - UTC
After testing in a stage environment, a change was deployed to the Auvik production environment.
The modifications in the scaled-up production environment caused the system to overload and stop processing the streaming monitoring data.
All times in UTC
03/14/2024
13:55 - Changes to production were introduced.
14:35 - Attention was raised to Auvik engineering as the systems for TrafficInsights, Syslog, and integrations began to crash loop.
14:40 - It was determined the issues raised in the initial report were more widespread than initially thought. The system as a whole was being overloaded with data.
14:45 - An incident was raised, and resources were called in to address it.
15:03 - Replication topic services were taken offline to reduce system load.
15:20-16:20 - An engineering team attempts to remove the additional data created by the overload.
16:20 - Engineering attempts to run commands to bulk remove the extraneous data from the system.
16:25- 17:30 - Engineering waits for the system to process the commands. Additional resources are added to the system to provide resources to process the load.
17:30 - The first cluster has now recovered.
17:50 - Engineering is seeing improvement across the system, with several other clusters starting to come back online.
18:35 - The bulk changes have the desired effect, and Services are starting to recover. Engineering begins by going through each service to validate health and functionality.
03/14/2024 - 03/15/2024
18:35 - 00:15 - Engineering continues working through all affected services and troubleshoots any unresolved issues. Temporary resources are added to speed things along.
03/15/2024
00:15 - The incident is declared closed.