Service Disruption - Clients on the US3 cluster are receiving 500 errors when trying to access their sites

Incident Report for Auvik Networks Inc.

Postmortem

Service Disruption

Backend Resource Strain and Service Disruption over a multiple-day period

Root Cause Analysis

Duration of incident

Discovered: Oct 07, 2024 09:56 - UTC
Resolved: Oct 07, 2024 19:00 - UTC
Discovered: Oct 14, 2024 10:55- UTC
Resolved: Oct 14, 2024 14:00 - UTC
Discovered: Oct 16, 2024 05:42 - UTC
Resolved: Oct 17, 2024 13:37 - UTC

Cause

The primary cause of this multi-day incident was a combination of backend instability and resource management challenges triggered by technical bugs and configuration issues. Specifically, a non-thread-safe map in the Autotask integration led to excessive CPU consumption, compounded by frequent tenant migrations and high memory usage across multiple clusters. Excessive API requests through the Web Application Firewall (WAF) and misconfigurations further strained backend resources, resulting in widespread service disruptions and extended recovery time.

Effect

The incident significantly impacted service availability and performance across multiple clusters. Users experienced frequent 500 and 504 errors, delays in accessing tenant data, and slow UI loading times. The high CPU usage and backend instability led to tenant migrations and disrupted connectivity, causing certain features to become intermittently unavailable. Additionally, the ongoing backend strain increased support cases and required multiple restarts and resource reallocations, prolonging the disruption and leading to a degraded experience for affected users over several days.

Action taken

All times in UTC

10/07/2024

Initial Detection and Escalation

09:56 - 10:02 Key symptoms identified:

High heap usage across multiple backends.
Communication failures between nodes in clusters CA1 and US1, causing tenant access issues.
Multiple tenants are stuck in a verifying state.

10:20 - 11:30 Escalated mitigations:
Decided to restart CA1, followed by US1, to address node communication issues.
The status page is updated to notify users of ongoing disruptions.
12:19 - 13:06 Status recap and monitoring of ongoing issues, including:
Continued high heap usage.
Tenant availability errors (504s) due to lost seed nodes.
Investigation of tenant verification issues.
14:00 - 19:00 Work continues on the instability of model investigations and backend performance issues, with some partial fixes applied.
19:00 Temporary workaround applied to stabilize model flapping.

10/14/2024

Continued Investigation and Remediation

11:30 Focused mitigation for US4 clients to stabilize tenant access and service performance.
14:00 Affected sites and tenants restarted, resolving some availability issues.

10/16/2024

Addressing WAF and High CPU Issues

17:15 WAF mitigation steps taken, blocking excessive requests from specific IPs.
18:31 WAF issues confirmed resolved after blocking IPs responsible for high traffic.

10/16/2024

High CPU Issues and Tenant Rebalancing

10:55 - 11:25 High CPU usage detected on multiple backends:
Affect backends are capped, restarted, and drained to mitigate load.
12:12 - 12:29 Specific tenant issues, including problematic tenants, were identified, which triggered frequent backend moves and further resource strain.
15:00 - 18:00 Troubleshooting and tenant isolation continue; problematic tenants are isolated, and partial recovery is achieved.

10/17/2024

Root Cause Fixes and Final Resolution

10:35 Further diagnosis identifies the root cause in the non-thread-safe map, leading to high CPU usage.
13:27 A short-term fix was applied to stabilize the problematic tenant and manage resource allocation.
13:37 Confirmed complete restoration of affected tenants and systems.

Future consideration(s)

Auvik has installed a repair for the model identification instability.
Auvik has implemented a repair address for tenants stuck in a verifying state who cannot locate their tenant manager.
Auvik has implemented a fix to prevent the identified third-party integration from locking CPU processes, which will cause the backend to fail due to high resource consumption.
Auvik has installed a fix to prevent long device names from causing continual tent failures across backends.
Auvik has added enhanced monitoring for excessive backend tenant failures.

Posted Nov 01, 2024 - 09:50 EDT

Resolved

The fix has been implemented for sites with 500 errors and inaccessible sites. The source of the disruption has been resolved, and services have been fully restored.

Posted Oct 16, 2024 - 06:58 EDT

Monitoring

We’ve identified the source of the service disruption with client sites on the US3 Cluster. When they try to access their sites, they receive 500 errors. We are implementing the fix and will keep you posted on a resolution.

Posted Oct 16, 2024 - 06:51 EDT

Identified

We’ve identified the source of the service disruption with client sites on the US3 Cluster. When they try to access their sites, they receive 500 errors. We are working to restore service as quickly as possible.

Posted Oct 16, 2024 - 06:38 EDT

Investigating

We’re experiencing disruption with client sites on the US3 Cluster. When they try to access their sites, they receive 500 errors. We will continue to provide updates as they become available.

Posted Oct 16, 2024 - 05:59 EDT

This incident affected: Network Mgmt (us3.my.auvik.com).