Duration of incident
Discovered: Jun 01, 2023 14:30 - UTC
Resolved: Jun 03, 2023 13:00 - UTC
Cause
Auvik’s software account system reloaded all account data rather than just records requiring updates, causing an extremely large and data-extensive update to Auvik’s platform.
Effect
This change caused a backup of over 13 billion transactions across Auvik’s databases, which in turn caused a delay for other processes, including multiple services not responding and preventing access to customers’ Auvik sites.
Action taken
All times in UTC
06/01/2023
14:30 - Auvik Engineering notices a large CPU spike for processing on one of Auvik’s production clusters.
14:45 - A memory spike occurs. Auvik begins to receive complaints from customers about not being able to access the system
15:30 - Internal steps were taken by engineering to restart some backend services to alleviate the backpressure on the systems. No cause for the issue has been determined.
15:30 - 20:30 - Engineering continues monitoring the system while investigating the root cause.
20:30 - US Cluster 4 is still problematic. Due to continued issues with the cluster, engineering restarts its frontend services to try and stabilize access to the US4 cluster.
20:34 - Engineering begins to restart its frontend services on US4 Cluster.
21:34 - Engineering completes the restart of its front-end services on the US4 cluster. The system appears stable enough to monitor for the evening.
06/02/2023
10:00 - Auvik resources begin investigation over overnight system performance.
13:00 - Overall, the system is up, and the Incident on the status page is closed. Auvik Engineering continues to investigate the cause of the slowdown in the system. The resource issues are no longer occurring, but sites on the US4 cluster are periodically seeing access and performance issues.
13:00 - 17:00 - Investigation continues as to why some sites on our US4 cluster are still experiencing trailing issues. Auvik believes shutting down and restarting the whole of our US4 cluster will result in resolving the issues the cluster is experiencing. Engineering is still exploring logs and systems to discover what was the initial cause of the instability.
17:00 - The decision is made to avoid further disturbances and wait for our scheduled Saturday maintenance window to restart the US4 cluster. Auvik support continues to monitor the situation.
06/03/2023
11:00 -13:00 - Auvik performs its scheduled maintenance. This included a full stop and start of services on our US4 cluster. Services restart without issue, and the system appears stabilized.
06/03/2023- 06/14/2023
Engineering continues to investigate the cause of the initial spike in resource usage on June 1.
06/14/2023
The cause of the incident is determined and validated. Our accounts system data processing is determined to be the root cause. A plan is implemented to prevent this issue from occurring again.