Service Disruption - For several tenants on the CA1 and US3 clusters

Incident Report for Auvik Networks Inc.

Postmortem

Service Disruption

Backend Resource Strain and Service Disruption over a multiple-day period

Root Cause Analysis

Duration of incident

Discovered: Oct 07, 2024 09:56 - UTC
Resolved: Oct 07, 2024 19:00 - UTC
Discovered: Oct 14, 2024 10:55- UTC
Resolved: Oct 14, 2024 14:00 - UTC
Discovered: Oct 16, 2024 05:42 - UTC
Resolved: Oct 17, 2024 13:37 - UTC

Cause

The primary cause of this multi-day incident was a combination of backend instability and resource management challenges triggered by technical bugs and configuration issues. Specifically, a non-thread-safe map in the Autotask integration led to excessive CPU consumption, compounded by frequent tenant migrations and high memory usage across multiple clusters. Excessive API requests through the Web Application Firewall (WAF) and misconfigurations further strained backend resources, resulting in widespread service disruptions and extended recovery time.

Effect

The incident significantly impacted service availability and performance across multiple clusters. Users experienced frequent 500 and 504 errors, delays in accessing tenant data, and slow UI loading times. The high CPU usage and backend instability led to tenant migrations and disrupted connectivity, causing certain features to become intermittently unavailable. Additionally, the ongoing backend strain increased support cases and required multiple restarts and resource reallocations, prolonging the disruption and leading to a degraded experience for affected users over several days.

Action taken

All times in UTC

10/07/2024

Initial Detection and Escalation

09:56 - 10:02 Key symptoms identified:

  • High heap usage across multiple backends.
  • Communication failures between nodes in clusters CA1 and US1, causing tenant access issues.
  • Multiple tenants are stuck in a verifying state.

10:20 - 11:30 Escalated mitigations:
Decided to restart CA1, followed by US1, to address node communication issues.
The status page is updated to notify users of ongoing disruptions.
12:19 - 13:06 Status recap and monitoring of ongoing issues, including:
Continued high heap usage.
Tenant availability errors (504s) due to lost seed nodes.
Investigation of tenant verification issues.
14:00 - 19:00 Work continues on the instability of model investigations and backend performance issues, with some partial fixes applied.
19:00 Temporary workaround applied to stabilize model flapping.

10/14/2024

Continued Investigation and Remediation

11:30 Focused mitigation for US4 clients to stabilize tenant access and service performance.
14:00 Affected sites and tenants restarted, resolving some availability issues.

10/16/2024

Addressing WAF and High CPU Issues

17:15 WAF mitigation steps taken, blocking excessive requests from specific IPs.
18:31 WAF issues confirmed resolved after blocking IPs responsible for high traffic.

10/16/2024

High CPU Issues and Tenant Rebalancing

10:55 - 11:25 High CPU usage detected on multiple backends:
Affect backends are capped, restarted, and drained to mitigate load.
12:12 - 12:29 Specific tenant issues, including problematic tenants, were identified, which triggered frequent backend moves and further resource strain.
15:00 - 18:00 Troubleshooting and tenant isolation continue; problematic tenants are isolated, and partial recovery is achieved.

10/17/2024

Root Cause Fixes and Final Resolution

10:35 Further diagnosis identifies the root cause in the non-thread-safe map, leading to high CPU usage.
13:27 A short-term fix was applied to stabilize the problematic tenant and manage resource allocation.
13:37 Confirmed complete restoration of affected tenants and systems.

Future consideration(s)

  • Auvik has installed a repair for the model identification instability.
  • Auvik has implemented a repair address for tenants stuck in a verifying state who cannot locate their tenant manager.
  • Auvik has implemented a fix to prevent the identified third-party integration from locking CPU processes, which will cause the backend to fail due to high resource consumption.
  • Auvik has installed a fix to prevent long device names from causing continual tent failures across backends.
  • Auvik has added enhanced monitoring for excessive backend tenant failures.
Posted Nov 01, 2024 - 09:49 EDT

Resolved

We experienced disruption with several tenants on clusters CA1 and US3. Sites were unavailable. The source of the disruption has been resolved, and services have been fully restored.

A Root Cause Analysis (RCA) will follow after a full internal review.
Posted Oct 07, 2024 - 15:45 EDT

Update

The disruption to tenants on cluster CA1 appears to have been addressed. We are monitoring the situation to validate all sites are responding as expected. We will update this page when the validation is complete.
Posted Oct 07, 2024 - 15:26 EDT

Update

We’re experiencing disruption with several tenants on cluster CA1. Some sites are responding slowly in the UI. We are also receiving reports of 401 errors when accessing sites. We will continue to provide updates as they become available.
Posted Oct 07, 2024 - 14:45 EDT

Identified

We’re experiencing disruption with several tenants on cluster CA1. Some sites are responding slowly in the UI. We will continue to provide updates as they become available.
Posted Oct 07, 2024 - 13:38 EDT

Update

We’re experiencing disruption with several tenants on clusters CA1 and US3. Sites are unavailable. We will continue to provide updates as they become available.

We are receiving reports with UI responsiveness for clients on the CA1 and are investigating.

Clients on US3 are continuing to start up. We will continue to monitor this process throughout the action.
Posted Oct 07, 2024 - 12:48 EDT

Update

We’re experiencing disruption with several tenants on clusters CA1 and US3. Sites are unavailable. We will continue to provide updates as they become available.

Clients on the CA1 cluster have recovered.

Clients on US3 are continuing to start up. We will continue to monitor this process throughout the action.
Posted Oct 07, 2024 - 12:11 EDT

Monitoring

We’re experiencing disruption with several tenants on clusters CA1 and US3. Sites are unavailable. We will continue to provide updates as they become available.

The downtime for CA1 proceeds as expected, with 90% of reporting up. The remaining 10% are being monitored for completion.

Clients on US3 have begun their downtime window. We will continue to monitor this process throughout the action.
Posted Oct 07, 2024 - 11:30 EDT

Update

We’re experiencing disruption with several tenants on clusters CA1 and US3. Sites are unavailable. We will continue to provide updates as they become available.

Auvik is required to restart all tenants on the US3 cluster at 15:30 UTC (11:30 EDT), a delay from the 14:50 UTC restart post. This will create a maintenance window of up to 1.5 hours, with most sites recovering before that.
Posted Oct 07, 2024 - 10:50 EDT

Update

We’re experiencing disruption with several tenants on clusters CA1 and US3. Sites are unavailable. We will continue to provide updates as they become available.

Auvik has begun restarting the CA1 cluster. This will take up to 1.5 hours, but most sites will recover before that.
Posted Oct 07, 2024 - 10:42 EDT

Update

We’re experiencing disruption with several tenants on clusters CA1 and US3. Sites are unavailable. We will continue to provide updates as they become available.

Auvik is required to restart all tenants on the US3 cluster at 14:50 UTC (10:50 EDT). This will create a maintenance window of up to 1.5 hours, with most sites recovering before that.
Posted Oct 07, 2024 - 10:25 EDT

Identified

We’re experiencing disruption with several tenants on clusters CA1 and US3. Sites are unavailable. We will continue to provide updates as they become available.

Auvik is required to restart all tenants on the CA1 cluster at 14:35 UTC (10:35 EDT). This will create a maintenance window of up to 1.5 hours, with most sites recovering before that.
Posted Oct 07, 2024 - 10:24 EDT

Investigating

We’re experiencing disruption with several tenants on clusters CA1 and US3. Sites are unavailable. We will continue to provide updates as they become available.
Posted Oct 07, 2024 - 10:10 EDT
This incident affected: Network Mgmt (us3.my.auvik.com, ca1.my.auvik.com).