Service Disruption - US4

Incident Report for Auvik Networks Inc.

Postmortem

Service Disruption - Clients on the US4 Cluster Unreachable

Root Cause Analysis

Duration of incident

Discovered: Feb 28, 2025 Time - 16:32 - UTC
Resolved: Feb 28, 2025 Time - 19:30- UTC

Cause

Overload of backend resources for services on the US4 cluster.

Effect

Tenants on the US4 cluster became inaccessible.

Action taken

All times in UTC

02/28/2025

16:32 - Auvik Engineering discovers several non-responsive backends on the US4 cluster, which causes some tenants to be unresponsive. Engineering begins investigating.

17:00 - Attempts are made to revive the non-responsive backends.

17:28 - Cluster is in distress, with more backends starting to fail.

17:45 - Engineering restarts the entire cluster.

18:10-19:30 - The cluster is observed as it restarts and monitored as it comes up to full functionality. The incident is declared resolved.

Future consideration(s)

  • Auvik is currently improving backend monitoring and stability within the product and infrastructure. These improvements aim to help mitigate potential issues proactively in the future.
Posted Mar 11, 2025 - 10:04 EDT

Resolved

Affected Services: Tenants on US4
Services not impacted: Tenants on all other clusters

Description:
The issue affecting tenant inaccessibility on the US4 cluster has been fully resolved. Regular service has been restored, and all systems are now operating as expected.

Impact:
Users should no longer experience any issues related to this incident.

Next Steps:
We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.
Posted Feb 24, 2025 - 14:32 EST

Monitoring

Affected Services: Clients on the US4 cluster
Service not impacted: Clients other clusters

Description:
Our team has fixed the issue affecting tenants' inaccessibility on the US4 cluster. The remaining tenants are recovering. We are monitoring the situation to ensure stability and confirm that the service remains fully functional.

Impact:
Service should operate normally; some tenant sites are still becoming accessible.
Services: sites on other clusters are not affected

Next Steps:
We will provide a final update once the issue is resolved.

Thank you for your patience, and we apologize for any inconvenience caused.
Posted Feb 24, 2025 - 14:00 EST

Update

Affected Services: Clients on the US4 cluster
Service not impacted: Clients other clusters

Description:
Our team has identified the root cause of the degraded performance with tenants on the US4 cluster. We are seeing tenants becoming available to normal service levels.

Impact:
While we work on the resolution, users start to see their tenants become responsive,
Services: Other clusters are not impacted.

Next Steps:
Our team is actively working to resolve the issue and will provide updates as progress is made or by 19:30 -UTC.

Thank you for your patience as we work to restore full functionality.
Posted Feb 24, 2025 - 13:29 EST

Identified

Affected Services: Clients on the US4 cluster
Service not impacted: Clients other clusters

Description:
Our team has identified the root cause of the degraded performance affecting tenants on the US4 cluster and is currently investigating a solution to restore normal service levels.

Impact:
Users will experience issues with connectivity to their tenants
Services: Other clusters are not experiencing issues

Next Steps:
Our team is actively working to resolve the issue and will provide updates as progress is made or by 18:30 UTC

Thank you for your patience as we work to restore full functionality.
Posted Feb 24, 2025 - 12:49 EST

Investigating

Affected Services: Clients on the US4 cluster
Service not impacted: Clients other clusters

Description:
We are experiencing degraded performance with tenants on the US4 cluster. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
Users will experience issues with connectivity to their tenants
Services: Other clusters are not experiencing issues

Next Steps:
We will provide updates as more information becomes available or by 18:30 UTC.

Thank you for your patience as we work to restore full functionality.
Posted Feb 24, 2025 - 12:45 EST
This incident affected: Network Mgmt (us4.my.auvik.com).