Service Disruption - Shared Collectors lost association with shared sites after maintenance
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - Customers with Shared Collectors on the US2 Cluster Lost the Association with the Sites Monitored by the Shared Collectors

Root Cause Analysis

Duration of incident

Discovered: Oct 07, 2024 13:00 UTC
Resolved: Oct 07, 2024 19:00 UTC

Cause

During maintenance on October 05, 2024, at 10:00 UTC, modifications were made to address a bug with protocol handling on the US2 cluster. This produced an excessive load on the system, which caused the cluster's start-up to fail. The cluster was successfully restarted.

Effect

The information for shared collector sites was not processed properly on the restart, and the association with the sites for those collectors was removed. This caused the sites not to be monitored during this time.

Action taken

All times in UTC

10/05/2024

11:00 - Scheduled maintenance upgrade of the system begins.

11:44 - The US2 cluster is started.

12:20 - The code change is enabled for clients on the US2 cluster to address the protocol handling.

12:22 - Tenants are started on the US2.

12:55 - The US2 cluster is found to be disconnected. Auvik takes steps to be able to restart the US2 cluster.

13:10 - Tenants are started on the US2 cluster for the second time, and the maintenance banner is removed from the Auvik site.

10/07/2024

9:00-12:00 - Auvik support receives multiple reports of client issues. Data is gathered from tenants on several different clusters for different problems. This data is collected and sent to engineering.

12:00-13:00 - Engineering determines that multiple issues in the product occurred during the scheduled maintenance. These issues are not associated with each other and will need separate teams to address them.

13:00-15:30 - Engineering is able to determine that the shared collectors of clients on the US2 cluster have lost their association with their sites.

15:30-18:00 - Engineering investigates which clients were explicitly affected by the loss of the collectors and their states before the maintenance on October 05.

18:00-18:20 - The decision to reset the shared collector states to what they were before maintenance is made, and Engineering is given the go-ahead to proceed with the reset.

18:20 - The process for the restore is executed.

19:00 - Sites with shared collectors are validated to be restored to their state before the October 05 maintenance.

Future consideration(s)

  • Improvements to the process for system health after maintenance will be made to protect against significant changes with collectors connected to the system.
  • Engineering is investigating improvements to the system to provide safeguards against the system processing changes to sites with shared collectors due to a lack of data after maintenance.
Posted Oct 10, 2024 - 11:09 EDT

Resolved
We experienced disruption with Shared Collectors and lost association with shared sites after maintenance on Saturday at 13:00 UTC. The current fix is to re-associate the Shared collector with the site.
We are investigating to determine a root cause.
Posted Oct 07, 2024 - 05:00 EDT