Auvik Reporting Sites Down Post After Maintenance

Incident Report for Auvik Networks Inc.

Postmortem

Service Disruption - Sites are not available after maintenance

Root Cause Analysis

Duration of incident

Discovered: May 10, 2025 13:04 - UTC
Resolved: May 11, 2025 01:00 - UTC

Cause

A scheduled upgrade of the system failed to complete successfully.

Effect

Auvik functionality was impacted after the upgrade was implemented. This began a cascade of product functionality failures that required reimplementing the upgraded version using a stepped restart of Auvik.

Action taken

All times are in UTC

04/10/2025

11:00 Upgrade process begins on core components.
12:45 An issue is detected affecting data replication, and some clusters experience connectivity problems.
13:05 Engineering begins active investigation into the connectivity issue.
13:24 Recovery actions initiated for affected clusters.
13:49 Maintenance window extended to address ongoing issues.
14:00-14:05 Impacted clusters begin recovering.
14:21 Post-upgrade validation reveals a new issue affecting dashboard display in most regions.
14:35 Further analysis confirms the issue affects multiple clusters.
15:00 Deeper technical investigation begins to isolate the root cause, which is suspected to involve backend services.
17:04 Root cause identified as an issue with a core data processing component.
17:20 Mitigation strategies explored; decision made to re-attempt the upgrade with a modified approach.
18:30-20:17 Second upgrade process begins; similar issues surface in specific regions.
21:00-21:25 Recovery actions for affected clusters show positive results; services begin to stabilize.
21:30-21:40 Core services successfully rolled out to additional clusters with improved configuration.
23:47 One final cluster exhibits recovery issues, addressed through targeted intervention.

05/11/2025

00:00-01:00 Final recovery actions completed; all services return to normal.
01:00 Complete system restoration is confirmed.

Future consideration(s)

  • Implement additional alerting to monitor bandwidth issues on the backend systems more effectively and proactively to prevent bottlenecks.
  • Complete the improvements that are already in progress.

    • Mitigate the load placed on all backend systems simultaneously after a maintenance window.
    • Remove several single-point failure configurations with more scalable configurations.
Posted May 16, 2025 - 10:45 EDT

Resolved

Towards the end of Auvik's scheduled maintenance window, on 5/10/2025, Engineering noticed some loading issues with sites on several clusters. Upon investigation, it was determined there was an issue with the data flow between systems. This interruption required Auvik to extend its maintenance window. Auvik was able to bring each cluster's tenants up throughout the process. This work was considered completed at 21:00 EDT.

Auvik will furnish an RCA after an internal review has been completed.
Posted May 10, 2025 - 10:00 EDT