Service Disruption - 401 and 404 errors connecting to the Auvik US4 cluster
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - Access to Auvik Sporadic or Completely not Accessible

Root Cause Analysis

Duration of incident

Discovered: Jun 01, 2023 14:30 - UTC
Resolved: Jun 03, 2023 13:00 - UTC

Cause

Auvik’s software account system reloaded all account data rather than just records requiring updates, causing an extremely large and data-extensive update to Auvik’s platform.

Effect

This change caused a backup of over 13 billion transactions across Auvik’s databases, which in turn caused a delay for other processes, including multiple services not responding and preventing access to customers’ Auvik sites.

Action taken

All times in UTC

06/01/2023

14:30 - Auvik Engineering notices a large CPU spike for processing on one of Auvik’s production clusters.

14:45 - A memory spike occurs. Auvik begins to receive complaints from customers about not being able to access the system

15:30 - Internal steps were taken by engineering to restart some backend services to alleviate the backpressure on the systems. No cause for the issue has been determined.

15:30 - 20:30 - Engineering continues monitoring the system while investigating the root cause.

20:30 - US Cluster 4 is still problematic. Due to continued issues with the cluster, engineering restarts its frontend services to try and stabilize access to the US4 cluster.

20:34 - Engineering begins to restart its frontend services on US4 Cluster.

21:34 - Engineering completes the restart of its front-end services on the US4 cluster. The system appears stable enough to monitor for the evening.

06/02/2023

10:00 - Auvik resources begin investigation over overnight system performance.

13:00 - Overall, the system is up, and the Incident on the status page is closed. Auvik Engineering continues to investigate the cause of the slowdown in the system. The resource issues are no longer occurring, but sites on the US4 cluster are periodically seeing access and performance issues.

13:00 - 17:00 - Investigation continues as to why some sites on our US4 cluster are still experiencing trailing issues. Auvik believes shutting down and restarting the whole of our US4 cluster will result in resolving the issues the cluster is experiencing. Engineering is still exploring logs and systems to discover what was the initial cause of the instability.

17:00 - The decision is made to avoid further disturbances and wait for our scheduled Saturday maintenance window to restart the US4 cluster. Auvik support continues to monitor the situation.

06/03/2023

11:00 -13:00 - Auvik performs its scheduled maintenance. This included a full stop and start of services on our US4 cluster. Services restart without issue, and the system appears stabilized.

06/03/2023- 06/14/2023

Engineering continues to investigate the cause of the initial spike in resource usage on June 1.

06/14/2023

The cause of the incident is determined and validated. Our accounts system data processing is determined to be the root cause. A plan is implemented to prevent this issue from occurring again.

Future consideration(s)

  • Auvik will investigate the processes its account software uses to see if a less resource-intensive method can be used.
  • Auvik will alter its entry of data methods to lower any impact a large call may have on its system for account reconciliation.
  • Auvik will address a discovered Bug in its code to prevent this process from taking down services.
Posted Jun 26, 2023 - 10:20 EDT

Resolved
The resolution for the service disruption to the US4 cluster is in place. The source of the disruption has been resolved, and services have been fully restored.
Posted Jun 02, 2023 - 09:39 EDT
Investigating
We are continuing to investigate a service disruption to the US4 cluster. Access to client sites has been restored, but we continue to receive reports of intermittent issues of the sites not loading properly. We will continue to provide updates as they become available.
Posted Jun 02, 2023 - 06:48 EDT
Monitoring
We’ve addressed the source of the service disruption with the US4 cluster and continue to monitor the situation. The current status has stabilized. We will monitor it over the evening to be sure. We will post an update by 13:00 UTC unless the circumstances change.
Posted Jun 01, 2023 - 16:33 EDT
Update
We are continuing to investigate the service disruption to the US4 cluster. We will continue to provide updates as they become available.
Posted Jun 01, 2023 - 15:52 EDT
Investigating
We’re experiencing a service disruption to the US4 cluster. Access to sites is causing 401/404 errors for clients attempting to connect. We will continue to provide updates as they become available.
Posted Jun 01, 2023 - 14:46 EDT
This incident affected: Network Mgmt (us4.my.auvik.com).