Discovered: Dec 19, 2023 15:58 - UTC
Resolved: Dec 19, 2023 18:15 - UTC
Action to address clean-up of residual issues from the incident, Service Disruption - Devices Deleted from Auvik UI when seen as down in health check.
Resources on the backend of the US3 cluster were overloaded. Collectors then disconnected from tenants on backends. This caused logins to fail until the tenants were restarted. The US3 cluster reboot was then performed to regain cluster stability, which behaved like an Auvik biweekly maintenance window.
This is when all collectors disconnected, and tenants could not log in for a few minutes up to a few hours, depending on the order of when they restarted.
No data loss of existing collected data occurred during this incident until the recovery.
All times in UTC
12/19/2024
15:58 - Auvik Engineering notices issues with US3 customer tenants.
16:20 - Initial investigation into metrics on the US3 cluster.
17:00 - Engineering decides to reboot the US3 cluster.
17:00- 18:15 Engineering monitors the tenants after the reboot, much like after a maintenance window. The Engineering manually brought up larger tenants.
18:15 - The Incident is deemed closed,