Service Disruption - The US4 cluster is down

Incident Report for Auvik Networks Inc.

Postmortem

Service Disruption - Clients on US4 are not accessible

Root Cause Analysis

Duration of incident

Discovered: Feb 03, 2023 Time - 09:27 - UTC

Resolved: Feb 03, 2023 Time - 11:55 - UTC

Cause

Overload of backend resources for services on the US4 cluster.

Effect

Tenants on the US4 cluster became inaccessible.

Action taken

All times in UTC

02/03/2025

09:27 - Engineering receives alerts that tenants on the US4 cluster are not accessible.

09:33 - Engineering reacts to the outage and begins its investigation.

09:53 - Engineering restarts US4 cluster backends to address its non-responsiveness.

9:53- 11:55 - The cluster is observed as it restarted and monitored as it comes up to full functionality. The incident is declared resolved.

Future consideration(s)

  • Auvik is currently improving backend monitoring and stability within the product and infrastructure. These improvements are aimed to assist in proactively mitigating potential issues in the future.
Posted 24 days ago. Feb 18, 2025 - 11:38 EST

Resolved

Affected Services: clients in US4 are now accessible.

Description:
The issue affecting US4 tenants has been resolved. Regular service has been restored, and all systems are operating as expected.

Impact:
Users should no longer experience any issues related to this incident.

Next Steps:
We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.
Posted 1 month ago. Feb 03, 2025 - 06:30 EST

Monitoring

Affected Services: Clients on US4 Cluster


Description:
Our team has implemented a fix for the issue, and tenants are in the process of becoming fully accessible. We are monitoring the situation to ensure stability and confirm that the service remains fully functional.

Impact:
Services should be operating normally; with a few client sites still in the process of starting up. We continue to monitor for any irregularities.

Next Steps:
We will provide a final update once we confirm the issue is fully resolved.

Thank you for your patience, and we apologize for any inconvenience caused.
Posted 1 month ago. Feb 03, 2025 - 06:01 EST

Identified

Affected Services: All clients are currently not accessible

Description:
Our team has identified the root cause of the site down. We are currently investigating a solution to restore normal service levels.

Impact:
While we work on the resolution, users may experience slower load times and intermittent connectivity issues,

Next Steps:
Our team is actively working to resolve the issue and will provide updates as progress is made or by 11:00 UTC

Thank you for your patience as we work to restore full functionality.
Posted 1 month ago. Feb 03, 2025 - 05:32 EST

Investigating

Affected Services: All clients are currently not accessible
Service not impacted: NA

Description:
Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
Users are experiencing no access to their tenants

Next Steps:
We will provide updates as more information becomes available or within the next at 11:00 UTC.

Thank you for your patience as we work to restore full functionality.
Posted 1 month ago. Feb 03, 2025 - 05:23 EST
This incident affected: Network Mgmt (us4.my.auvik.com).