Discovered: Feb 19, 2025 14:18 - UTC
Resolved: Mar 01, 2025 15:00 - UTC
The Cloud Ping service became unstable due to a large number of clients running ping checks at a 5-second interval, leading to widespread ping check failures.
Clients received excessive Cloud ping check alerts corresponding to failed pings.
All times in UTC
02/13/2025-02/19/2025
Auvik started receiving complaints about an unusually high number of internet connection failures. A general investigation begins with customers reporting these issues.
02/19/2025
14:18 - Auvik Engineering ascertains that the US3 cluster has several clients with a high number of internet connection checks set to the 5-second setting. An internal investigation then begins.
17:42 - Auvik disables Cloud Ping alerts in the US3 cluster for those affected.
17:53-18:44 - Auvik Engineering decides to restart the ping service to help clear the lag and re-stabilize it. A maintenance window is required to perform this action.
19:00 - A one-hour maintenance window is started.
19:21 - The work required under the maintenance window concludes early, and the services are back up and running. Cloud Ping alerts are restored for all clients.
02/24/2025
It’s noted that while the ping service is behaving normally for most clients, there continue to be intermittent problems. It is determined that a complete cluster restart is required. To minimize the impact on all customers, a decision is made to do maintenance on 03/01/2025
03/01/2025
12:00-15:00 - Auvik undergoes maintenance, during which US3 is safely restarted to restore the health of all services.