Service Disruption - US4 cluster is unreachable

Incident Report for Auvik Networks Inc.

Postmortem

Service Disruption - Cluster US4 is unreachable for customers

Root Cause Analysis

Duration of incident

Discovered: Dec 13, 2024 17:03 - UTC
Resolved: Dec 13, 2024 18:23 - UTC

Cause

Routine maintenance tasks in preparation for the upcoming weekend's maintenance cause an unexpected load to the system.
Effect
The backend systems overwhelmed the systems on the US4 cluster, which caused a communication interruption with the tenants.

Action taken

All times in UTC
12/13/2024

16:57 - Steps to prepare the system for the next day’s maintenance performed.

17:03 - Tenants on the US4 cluster become unreachable.

17:09 - The Auvik engineering team assembles stakeholders to investigate the service interruption.

17:25 - The backend systems on the US4 cluster begin to recover independently.

17:39 - Tenants begin to become reachable internally.

17:40 - Tenants become visible in the UI.

17:57 - Engineering addressed tenants that are not coming back up gracefully.

18:23 - Tenants on US4 have recovered.

Future consideration(s)

  • Auvik has altered its preparation for scheduled maintenance, eliminating processes that could affect system performance in the future.
Posted Jan 10, 2025 - 10:34 EST

Resolved

Affected Services: US4 Cluster

Description:
The issue affecting US4 has been addressed and the system has recovered.

Impact:
Users should now be able to access their tenants on US4.

Next Steps:
We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.
Posted Dec 13, 2024 - 13:23 EST

Monitoring

Affected Services: US4 cluster

Description:
Our team has implemented a fix for the issue affecting the US4 cluster. Tenants are being restored and we are continuing to monitor the recovery progress.

Impact:
Any unreachable tenant is queued to be started and will be reachable within approximately 1 hour.

Next Steps:
We will provide a final update once we confirm the issue is fully resolved.

Thank you for your patience, and we apologize for any inconvenience caused.
Posted Dec 13, 2024 - 12:59 EST

Update

We are continuing to investigate this issue.
Posted Dec 13, 2024 - 12:51 EST

Investigating

Affected Services: US4 Cluster

Description:
We are currently experiencing an outage on tenants hosted on our US4 cluster. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
Users will not be able to reach their tenants hosted in US4.

Next Steps:
We will provide updates as more information becomes available or within the next hour.

Thank you for your patience as we work to restore full functionality.
Posted Dec 13, 2024 - 12:18 EST
This incident affected: Network Mgmt (us4.my.auvik.com).