Discovered: Oct 07, 2024 13:00 UTC
Resolved: Oct 07, 2024 19:00 UTC
During maintenance on October 05, 2024, at 10:00 UTC, modifications were made to address a bug with protocol handling on the US2 cluster. This produced an excessive load on the system, which caused the cluster's start-up to fail. The cluster was successfully restarted.
The information for shared collector sites was not processed properly on the restart, and the association with the sites for those collectors was removed. This caused the sites not to be monitored during this time.
All times in UTC
10/05/2024
11:00 - Scheduled maintenance upgrade of the system begins.
11:44 - The US2 cluster is started.
12:20 - The code change is enabled for clients on the US2 cluster to address the protocol handling.
12:22 - Tenants are started on the US2.
12:55 - The US2 cluster is found to be disconnected. Auvik takes steps to be able to restart the US2 cluster.
13:10 - Tenants are started on the US2 cluster for the second time, and the maintenance banner is removed from the Auvik site.
10/07/2024
9:00-12:00 - Auvik support receives multiple reports of client issues. Data is gathered from tenants on several different clusters for different problems. This data is collected and sent to engineering.
12:00-13:00 - Engineering determines that multiple issues in the product occurred during the scheduled maintenance. These issues are not associated with each other and will need separate teams to address them.
13:00-15:30 - Engineering is able to determine that the shared collectors of clients on the US2 cluster have lost their association with their sites.
15:30-18:00 - Engineering investigates which clients were explicitly affected by the loss of the collectors and their states before the maintenance on October 05.
18:00-18:20 - The decision to reset the shared collector states to what they were before maintenance is made, and Engineering is given the go-ahead to proceed with the reset.
18:20 - The process for the restore is executed.
19:00 - Sites with shared collectors are validated to be restored to their state before the October 05 maintenance.