Duration of incident
Discovered: Jun 1, 2024, 20:30 - UTC
Resolved: Jun 7, 2024, 13:05 - UTC
Updates performed during scheduled maintenance on June 1, 2024, caused an improper assertion on data in the Auvik application’s data stream.
The service disruption resulted in a significant increase in the data in the streaming queue, leading to noticeable delays in data processing for our customers. This was particularly evident in map rendering and updating, impacting the real-time visibility of our services for our stakeholders.
All times in UTC
06/01/2024
20:30 - Auvik support alerts the on-call engineering team of abnormal CPU spikes in processing data.
20:59 - The engineering team begins its initial investigation.
21:11 - Engineering determines that the system is, indeed, seeing increased data input within the system.
21:15 - The team works to identify the cause of the increased input.
23:00 - The team identifies the specific data flows and increased input and turns off the presumed change that caused these issues.
06/02/2024
02:00 - The team implements the changes into one cluster and waits to validate that the change resolves the issues.
11:28 - It is reported that the change did not resolve the issue and the ongoing incident. The engineering team assembles to determine the root cause.
11:45-17:00 - Engineering continues investigating the issue to determine a fix.
17:00 - The root cause of the issues is determined, and the next steps to resolve the incident are formulated.
17:00-21:30 - A fix for the issues is written and tested successfully.
22:45 - A plan for deploying the fix to production is formulated.
06/03/2024
01:00-2:45 - The proposed fix is deployed to one cluster to test and validate its correctness in the production environment.
13:30 - The team validates the desired results in the test cluster and formulates a plan for the remaining clusters.
16:00-21:30 - The fix is deployed to the remaining clusters. The team will wait for the backlog to catch up.
06/04/2024
05:00-18:55 - Engineering makes several changes to increase resourcing and velocity of the backlog processing. During this time, all non-US clusters recover from their data delay.
23:15 - The US4 cluster recovers from its data delay.
06/06/2024
09:00 - The US3 and US5 clusters recover from their data delay.
06/07/2024
08:35 - The US1 cluster recovers from its data delay.
13:05 - The US2 cluster recovers from its data delay. The incident is closed on the status page.