Service Degraded - Cloud Ping check not working for some tenants

Incident Report for Auvik Networks Inc.

Postmortem

Service Degraded - Cloud Ping Services Check Failing Intermittently on the US3 Cluster

Root Cause Analysis

Duration of incident

Discovered: Feb 19, 2025 14:18 - UTC
Resolved: Mar 01, 2025 15:00 - UTC

Cause

The Cloud Ping service became unstable due to a large number of clients running ping checks at a 5-second interval, leading to widespread ping check failures.

Effect

Clients received excessive Cloud ping check alerts corresponding to failed pings.

Action taken

All times in UTC

02/13/2025-02/19/2025

Auvik started receiving complaints about an unusually high number of internet connection failures. A general investigation begins with customers reporting these issues.

02/19/2025

14:18 - Auvik Engineering ascertains that the US3 cluster has several clients with a high number of internet connection checks set to the 5-second setting. An internal investigation then begins.

17:42 - Auvik disables Cloud Ping alerts in the US3 cluster for those affected.

17:53-18:44 - Auvik Engineering decides to restart the ping service to help clear the lag and re-stabilize it. A maintenance window is required to perform this action.

19:00 - A one-hour maintenance window is started.

19:21 - The work required under the maintenance window concludes early, and the services are back up and running. Cloud Ping alerts are restored for all clients.

02/24/2025

It’s noted that while the ping service is behaving normally for most clients, there continue to be intermittent problems. It is determined that a complete cluster restart is required. To minimize the impact on all customers, a decision is made to do maintenance on 03/01/2025

03/01/2025

12:00-15:00 - Auvik undergoes maintenance, during which US3 is safely restarted to restore the health of all services.

Future consideration(s)

  • Auvik has worked with several clients who have set up a 5-second ping check to regulate the flow and prevent system overload.
  • Auvik will remove clients' ability to perform a 5-second cloud ping check and default the check frequency to one minute.
    The timing of this change will follow in future Auvik release notes.
Posted Mar 11, 2025 - 09:56 EDT

Resolved

Affected Services: Cloud Ping Check
Cluster(s): All Clusters

Description:
We are currently experiencing degraded performance with the Cloud Ping Service check. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
Users should no longer experience any issues related to this incident.

Next Steps:
We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.
Posted Feb 19, 2025 - 15:27 EST

Monitoring

Affected Services: Cloud Ping Check
Cluster(s): All Clusters

Description:
We are currently experiencing degraded performance with the Cloud Ping Service check. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
Service should operate normally; however, we continue monitoring for any irregularities.
Services: All other monitoring, alerting, maps and integrations are not impacted.


Next Steps:
We will provide a final update once we confirm the issue is fully resolved.

Thank you for your patience, and we apologize for any inconvenience caused.
Posted Feb 19, 2025 - 14:28 EST

Update

Affected Services: All alerts
Cluster(s): All Clusters

The alerting maintenance window has been ended. alerts will now flow as intended

Thank you for your patience as we work to restore full functionality.
Posted Feb 19, 2025 - 14:21 EST

Update

Affected Services: All alerts
Cluster(s): All Clusters

Auvik is posting an emergency maintenance window to disable alerts starting at 19:00 UTC. Alerts are scheduled to be re-enabled by 20:00 UTC

Thank you for your patience as we work to restore full functionality.
Posted Feb 19, 2025 - 13:44 EST

Identified

Affected Services: Cloud Ping Check
Cluster(s): All Clusters

Description:
We are currently experiencing degraded performance with the Cloud Ping Service check. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
Users may experience excessive false alerts. Resources failing over from US3 may affect alerting in other clusters.
Services: All other monitoring, alerting, maps and integrations are not impacted.

Next Steps:
Our team is actively working to resolve the issue and will provide updates as progress is made or by 19:00 UTC

Thank you for your patience as we work to restore full functionality.
Posted Feb 19, 2025 - 13:00 EST

Update

Affected Services: Cloud Ping Check
Cluster(s): US3

Description:
We are currently experiencing degraded performance with the Cloud Ping Service check. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
Users may experience excessive false alerts.
Services: All other monitoring, alerting, maps and integrations are not impacted.

Auvik recommends you disable your Cloud Ping Check and any customized Cloud Ping Check alerts until the problem is resolved,

Next Steps:
We will provide updates as more information becomes available or by 18:00 UTC

Thank you for your patience as we work to restore full functionality.
Posted Feb 19, 2025 - 12:28 EST

Investigating

Affected Services: Cloud Ping Check
Cluster(s): US3

Description:
We are currently experiencing degraded performance with the Cloud Ping Service check. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
Users may experience excessive false alerts.
Services: All other monitoring, alerting, maps and integrations are not impacted.

Next Steps:
We will provide updates as more information becomes available or within the next hour

Thank you for your patience as we work to restore full functionality.
Posted Feb 19, 2025 - 12:14 EST
This incident affected: Network Mgmt (us3.my.auvik.com).