Discovered: Mar 11, 2024 13:35 - UTC
Resolved: Mar 11, 2024 16:30 - UTC
Issues with the March 9th system update.
The internal alerting engine for the US1 cluster got into a state where alerts were delayed to clients on US1 from March 9th at 11:00 UTC until March 11 at 16:30 UTC.
All times in UTC
03/09/2024
11:00 -14:00 - Auvik performs scheduled maintenance on the system. The planned maintenance is extended due to issues with the system during the restart. It is believed the system has recovered successfully.
03/11/2024
08:00 - Auvik Engineering identifies from internal alerting that the alerts on the US1 cluster are lagging when displayed in the system.
08:55 - Engineering restarts the alerting service and attempts to create a new checkpoint for alerting.
10:05 - The new checkpoint failed, and the responsible engineering team was notified about the system delay with the US1 cluster alerts.
13:18 - An external incident was posted to alert clients about the delay. Engineering continues to work on the issue.
14:00 - Progress is achieved by saving checkpoints within the system for the alerting assembler.
15:00 - Alert lag begins to fall.
16:00 - Alert lag has been processed successfully.
16:20 - Engineering declares the incident closed.