Service Degraded - Some Clients on the US4 cluster are offline.

Incident Report for Auvik Networks Inc.

Postmortem

Service Disruption - Over 50% of clients on the US4 cluster experienced service interruptions.

Root Cause Analysis

Duration of incident

Discovered: Apr 14, 2025 19:45 UTC
Resolved: Apr 15, 2025 04:05 UTC

Cause

A configuration change related to Meraki Devices.

Effect

About 55% of tenants in US4 became inaccessible due to increased traffic and system load.
Action taken

All times are in UTC

04/14/2025

19:45 - Auvik receives internal alerts for abnormal CPU usage on its backend systems for the US4 cluster.

19:50 - Engineering begins an investigation into the issue, actively taking measures to stabilize the system.

20:42 - A large number of sites become inaccessible, and Auvik implements its incident response.

20:42-21:45 - Engineering continues to investigate.

21:45 - A possible root cause of the issue is identified, and Engineering begins recovering sites.

04/14/25-04/15/25

21:45 - 00:10 - Engineering continues to bring most of the affected sites back online.

04/15/25

00:10 - All sites, except one client, are back up and accessible.

00:10-01:00 - Auvik continues to work on bringing the last client tenants online and getting them up and running.

01:00 - A root cause is determined for the cause of the incident. Engineering creates mitigation steps.

01:00-03:05 - Mitigation steps are implemented, and the remaining sites of the last client are brought online and accessible.

Future consideration(s)

  • Auvik has implemented safeguards to prevent a recurrence.
Posted Apr 17, 2025 - 11:58 EDT

Resolved

Affected Services: Site availability
Cluster(s): US4

Description:
The issue affecting site availability has been fully resolved. Regular service has been restored, and all systems are now operating as expected.

Impact:
Users should no longer experience issues related to this incident except for select clients we have communicated with.

Next Steps:
We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.
Posted Apr 14, 2025 - 22:25 EDT

Update

Affected Services: Site availability
Cluster(s): US4

Description:
Our team has implemented a fix for the issue affecting site connectivity on the US4 cluster. We are waiting for the rest of the sites to be available online. We monitor the situation to ensure stability and confirm that the service remains fully functional.

Impact:
Services should operate normally, except for the remaining sites, which we continue working to make fully available.
Services: None of the other clusters and services are affected..

Next Steps:
We will provide a final update once all issues are resolved.

Thank you for your patience, and we apologize for any inconvenience caused.
Posted Apr 14, 2025 - 21:40 EDT

Update

Affected Services: Site availability
Cluster(s): US4

Description:
Our team has implemented a fix for the issue affecting site connectivity on the US4 cluster. We are waiting for the rest of the sites to be available online. We monitor the situation to ensure stability and confirm that the service remains fully functional.

Impact:
Services should operate normally, except for the remaining sites, which we are continuing to work to make fully available.
Services: None of the other clusters and services are affected..

Next Steps:
We will provide a final update once all issues are resolved.

Thank you for your patience, and we apologize for any inconvenience caused.
Posted Apr 14, 2025 - 20:39 EDT

Monitoring

Affected Services: Site availability
Cluster(s): US4

Description:
Our team has implemented a fix for the issue affecting site connectivity on the US4 cluster. We are waiting for the rest of the sites to be available online. We monitor the situation to ensure stability and confirm that the service remains fully functional.

Impact:
Services should be operating normally, except for the remaining site, which we are waiting for to become fully available.
Services: None of the other clusters and services are affected..

Next Steps:
We will provide a final update once all issues are resolved.

Thank you for your patience, and we apologize for any inconvenience caused.
Posted Apr 14, 2025 - 19:40 EDT

Identified

Affected Services: Site availability
Cluster(s): US4

Description:
Our team has identified the root cause of the degraded performance affecting client site availability in the US4 cluster. We are currently investigating a solution to restore normal service levels.

Impact:
While we work on the resolution, users may experience connectivity issues as sites become available again.
Services: None of the other clusters and services are affected.

Next Steps:
Our team is actively working to resolve the issue and will provide updates as progress is made or by 23:00 UTC

Thank you for your patience as we work to restore full functionality.
Posted Apr 14, 2025 - 18:55 EDT

Investigating

Affected Services: Site availability
Cluster(s): US4

Description:
We are currently experiencing degraded performance with sites running on the US4 cluster. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
Users may experience connectivity issues with their tenants.
Services: None of the other clusters and services are affected.

Next Steps:
We will provide updates as more information becomes available or within the next hour.

Thank you for your patience as we work to restore full functionality.
Posted Apr 14, 2025 - 18:25 EDT
This incident affected: Network Mgmt (us4.my.auvik.com).