Discovered: Oct 07, 2024 09:56 - UTC
Resolved: Oct 07, 2024 19:00 - UTC
Discovered: Oct 14, 2024 10:55- UTC
Resolved: Oct 14, 2024 14:00 - UTC
Discovered: Oct 16, 2024 05:42 - UTC
Resolved: Oct 17, 2024 13:37 - UTC
The primary cause of this multi-day incident was a combination of backend instability and resource management challenges triggered by technical bugs and configuration issues. Specifically, a non-thread-safe map in the Autotask integration led to excessive CPU consumption, compounded by frequent tenant migrations and high memory usage across multiple clusters. Excessive API requests through the Web Application Firewall (WAF) and misconfigurations further strained backend resources, resulting in widespread service disruptions and extended recovery time.
The incident significantly impacted service availability and performance across multiple clusters. Users experienced frequent 500 and 504 errors, delays in accessing tenant data, and slow UI loading times. The high CPU usage and backend instability led to tenant migrations and disrupted connectivity, causing certain features to become intermittently unavailable. Additionally, the ongoing backend strain increased support cases and required multiple restarts and resource reallocations, prolonging the disruption and leading to a degraded experience for affected users over several days.
All times in UTC
10/07/2024
Initial Detection and Escalation
09:56 - 10:02 Key symptoms identified:
10:20 - 11:30 Escalated mitigations:
Decided to restart CA1, followed by US1, to address node communication issues.
The status page is updated to notify users of ongoing disruptions.
12:19 - 13:06 Status recap and monitoring of ongoing issues, including:
Continued high heap usage.
Tenant availability errors (504s) due to lost seed nodes.
Investigation of tenant verification issues.
14:00 - 19:00 Work continues on the instability of model investigations and backend performance issues, with some partial fixes applied.
19:00 Temporary workaround applied to stabilize model flapping.
10/14/2024
Continued Investigation and Remediation
11:30 Focused mitigation for US4 clients to stabilize tenant access and service performance.
14:00 Affected sites and tenants restarted, resolving some availability issues.
10/16/2024
Addressing WAF and High CPU Issues
17:15 WAF mitigation steps taken, blocking excessive requests from specific IPs.
18:31 WAF issues confirmed resolved after blocking IPs responsible for high traffic.
10/16/2024
High CPU Issues and Tenant Rebalancing
10:55 - 11:25 High CPU usage detected on multiple backends:
Affect backends are capped, restarted, and drained to mitigate load.
12:12 - 12:29 Specific tenant issues, including problematic tenants, were identified, which triggered frequent backend moves and further resource strain.
15:00 - 18:00 Troubleshooting and tenant isolation continue; problematic tenants are isolated, and partial recovery is achieved.
10/17/2024
Root Cause Fixes and Final Resolution
10:35 Further diagnosis identifies the root cause in the non-thread-safe map, leading to high CPU usage.
13:27 A short-term fix was applied to stabilize the problematic tenant and manage resource allocation.
13:37 Confirmed complete restoration of affected tenants and systems.