Discovered: Mar 31, 2025, 13:15 - UTC
Resolved: Apr 01, 2025, 15:52 - UTC
The performance of the ping server service on the US3 cluster degraded and produced invalid data.
The ping server service sent incorrect data based on the internet connection checks to the alerting service, which created large batches of false alerts sent to customers on the US3 cluster.
All times are in UTC
03/31/2025
17:10 - Ping Server started showing symptoms of degradation.
17:15 - Internet Connections are marked offline. Customers experience excessive false alert reports based on the cloud ping check service on the US3 cluster.
17:20 - The Auvik engineering team begins its investigation.
17:20-20:00 - Auvik continues its investigations and disables the cloud ping service for several large customers on the US3 cluster to prevent excessive alerting once the service is restored.
20:00 - Auvik resets the ping server service on the US3 cluster. Ping services fail over to the backup primary ping server service.
22:25 - The primary ping server service load rises to a level that begins impacting customers on other clusters.
04/01/25
00:00 - The US3 cluster is restarted to revert cloud ping checks to the US3 cluster ping server services. Auvik notifies the customer where the cloud ping checks were disabled that the service will remain down until engineering can confirm they can be enabled without causing excessive alerting.
01:00-01:25 - The US3 cluster fully restarts successfully. Functionality is restored for most clients on the US3 cluster.
12:00-15:30 - Engineering reviews the disabled configurations and disables the responses to the cloud ping check-based alerts.
15:30-15:52 - Auvik validates that all cloud ping check services and alerts are enabled for all customers on the US3 cluster. Additional clean-up commences. The incident is concluded.
Our error handling in the service that processes the cloud ping server data has been improved to identify and ignore invalid data.