Service Disruption - US3 is down

Incident Report for Auvik Networks Inc.

Postmortem

Service Disruption - Clients on the US3 Cluster Unreachable

Root Cause Analysis

Duration of incident

Discovered: Feb 14, 2025 Time - 22:42 - UTC
Resolved: Feb 17, 2025 Time - 16:30 - UTC

Cause

Auvik made changes to its system to address issues with third-party integration for a client on US3 to process information that was not working as expected.

Effect

This change exposed a bug in the code that caused the backend systems to become overloaded. This caused data corruption in the hierarchical tables, which caused more instability in the system for clients on the US3 cluster.

Action taken

All times in UTC
02/14/2025

22:42 - First signs of increased backend pressure on the systems on the US3 cluster.

2/15/2025

15:40 - Backend pressure on the US3 cluster increases. Engineering begins to monitor its systems for performance issues.

16:00-23:00 - Engineering attempts several interventions to reduce backend pressure. Success is intermittent. Ultimately, the root cause is identified as an abnormally growing dataset due to a bug.

23:00 - The tenant associated with the data table is disabled. However, several backends in the cluster are in severe distress, requiring a complete reboot of the cluster. A reboot is initiated.

02/16/2025

00:00 - Most tenants are observed to be functional.

00:23 - Steps are initiated to re-enable the offending tenant. Unforeseen issues during this step create a cascading failure that results in another cluster reboot.

00:42 - The offending tenant is disabled again, and US3 is rebooted.

03:15 - Cluster is deemed stable.

16:00 - Engineering diagnoses a further root cause of the instability arising from the offending tenant.

17:00-21:30 - This tenant’s data is cleaned up manually. Finally, the tenant is restarted successfully.

02/17/25

16:30 - Engineering notes that some hierarchical datasets have been corrupted, causing some tenants' alert notifications to be set to default values. Engineering initiates a cleanup of all such occurrences.

03/01/25 - A fix is applied to prevent further occurrences of such issues. All clusters are upgraded with this bug fix.

Future consideration(s)

  • The bug that caused the instability has been addressed.
  • A review of how Auvik processes data in its hierarchical tables is under review.
  • Improved internal processes have been implemented to diagnose the cause of similar issues more quickly should they occur.
Posted Mar 11, 2025 - 09:48 EDT

Resolved

Affected clusters: US3

Description:
The issue affecting US3 has been fully resolved. Normal service has been restored, and all systems are now operating as expected.

Impact:
Users should no longer experience any issues related to this incident.

Next Steps:
We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.
Posted Feb 15, 2025 - 22:21 EST

Update

Affected clusters: US3

Description:
We have encountered an issue during monitoring of the cluster. The cluster is non-operational at this time. Our team is actively working to restore sites on this cluster as quickly as possible.

Impact:
Tenants hosted on US3 are not accessible at this time.

Next Steps:
We will provide updates as more information becomes available or within the next hour.

Thank you for your patience as we work to restore full functionality.
Posted Feb 15, 2025 - 19:51 EST

Update

Affected clusters: US3

Description:
The majority of tenants have been restored and we are performing final cluster checks.

Impact:
N/A

Next Steps:
We will provide updates as more information becomes available or within the next hour.

Thank you for your patience as we work to restore full functionality.
Posted Feb 15, 2025 - 19:25 EST

Monitoring

Affected clusters: US3

Description:
Our team has restarted US3. Tenants are starting and we are continuing to monitor recovery of all tenants.

Impact:
Some tenants hosted on US3 may still be starting and unreachable at this time.

Next Steps:
We will provide a final update once we confirm the issue is fully resolved.

Thank you for your patience, and we apologize for any inconvenience caused.
Posted Feb 15, 2025 - 18:30 EST

Identified

Affected clusters: US3

Description:
We are currently experiencing an outage on US3. Our team is actively working to restore sites on this cluster as quickly as possible.

Impact:
Sites hosted on US3 are currently unreachable.

Next Steps:
We will provide updates as more information becomes available or within the next hour.

Thank you for your patience as we work to restore full functionality.
Posted Feb 15, 2025 - 18:02 EST
This incident affected: Network Mgmt (us3.my.auvik.com).