Performance Disruption - Internal Data Requests to Auvik’s Systems Delayed to Customers on the US2 and US5 Clusters
Incident Report for Auvik Networks Inc.
Postmortem

Performance Disruption - Internal Data Requests to Auvik’s Systems Delayed to Customers on the US2 and US5 Clusters

Root Cause Analysis

Duration of incident

Discovered: Mar 28, 2024, 17:11 - UTC
Resolved: Mar 28, 2024, 20:20 - UTC

Cause

The delay in Auvik’s ability to process internal data requests was due to additional overhead created by implementing Auvik’s new Beta Alert testing.

Effect

The system delayed all requests: Mapping, UI updates, data retrieval, monitoring, and alerting. This was limited to customers with tenants on the US2 and US5 clusters.

Action taken

All times in UTC

03/18/2024

14:00 - Auvik Alerting Beta was deployed to the US2 and US5 clusters.

03/19/2024

13:00 - Additional resources were added to the US2 and US5 clusters to address the lag in processing due to the addition of the Auvik beta alerting deployment.

03/19/2024 - 03/28/2024

Alerting Beta continues to run on US2 and US5 clusters.

03/28/2024

17:11 - Engineering is addressing data processing lag issues reported by customers on the US2 and US5 clusters and has discovered a considerable lag in data processing for several Auvik processes.

17:20 - Internal Auvik resources meet to determine the root cause of the performance issues

18:05 - 18:15 - Auvik increases processing resources to the affected clusters, locking out the system for approximately 10 minutes for customers. Soon after, an update reporting the interruption is posted on the Auvik Status page.

18:15 - 22:20 - The engineering teams work with the hosting company to adjust the resources on the US2 and US5 clusters to handle the system's new processing requirements created by the Auvik Beta Alerting.

03/29/2024 - 03-30-2024

Non-optimized data and unused space are cleaned from the system to improve system efficiency and performance.

Future consideration(s)

  • Better understand the differences in database instances and implement the proper builds within the product.
  • Implement the proper internal alerting to prevent the growth of lag that was discovered in the incident.
  • Create an internal performance insight metric to understand better the effects of implementing significant scale changes to the system.
  • Evaluate engineering team permissions to the system and address blockers to resolve issues where appropriate.
Posted Apr 18, 2024 - 19:20 EDT

Resolved
Auvik’s systems experienced delays with data requests with customers on the US2 and US5 clusters on March 28th, 2024. This impact on performance occurred between 17:11 and 20:20 UTC. There was no impact of data loss or downtime.
Posted Mar 28, 2024 - 19:10 EDT