Discovered: Feb 14, 2025 Time - 22:42 - UTC
Resolved: Feb 17, 2025 Time - 16:30 - UTC
Auvik made changes to its system to address issues with third-party integration for a client on US3 to process information that was not working as expected.
This change exposed a bug in the code that caused the backend systems to become overloaded. This caused data corruption in the hierarchical tables, which caused more instability in the system for clients on the US3 cluster.
All times in UTC
02/14/2025
22:42 - First signs of increased backend pressure on the systems on the US3 cluster.
2/15/2025
15:40 - Backend pressure on the US3 cluster increases. Engineering begins to monitor its systems for performance issues.
16:00-23:00 - Engineering attempts several interventions to reduce backend pressure. Success is intermittent. Ultimately, the root cause is identified as an abnormally growing dataset due to a bug.
23:00 - The tenant associated with the data table is disabled. However, several backends in the cluster are in severe distress, requiring a complete reboot of the cluster. A reboot is initiated.
02/16/2025
00:00 - Most tenants are observed to be functional.
00:23 - Steps are initiated to re-enable the offending tenant. Unforeseen issues during this step create a cascading failure that results in another cluster reboot.
00:42 - The offending tenant is disabled again, and US3 is rebooted.
03:15 - Cluster is deemed stable.
16:00 - Engineering diagnoses a further root cause of the instability arising from the offending tenant.
17:00-21:30 - This tenant’s data is cleaned up manually. Finally, the tenant is restarted successfully.
02/17/25
16:30 - Engineering notes that some hierarchical datasets have been corrupted, causing some tenants' alert notifications to be set to default values. Engineering initiates a cleanup of all such occurrences.
03/01/25 - A fix is applied to prevent further occurrences of such issues. All clusters are upgraded with this bug fix.