Service Degraded - Internet Connection Checks are creating false alerts on the US3 cluster.

Incident Report for Auvik Networks Inc.

Postmortem

Service Disruption - Cloud Ping Checks create false alerts on the US3 cluster.

Root Cause Analysis

Duration of incident

Discovered: Mar 31, 2025, 13:15 - UTC
Resolved: Apr 01, 2025, 15:52 - UTC

Cause

The performance of the ping server service on the US3 cluster degraded and produced invalid data.

Effect

The ping server service sent incorrect data based on the internet connection checks to the alerting service, which created large batches of false alerts sent to customers on the US3 cluster.

Action taken

All times are in UTC

03/31/2025

17:10 - Ping Server started showing symptoms of degradation.

17:15 - Internet Connections are marked offline. Customers experience excessive false alert reports based on the cloud ping check service on the US3 cluster.

17:20 - The Auvik engineering team begins its investigation.

17:20-20:00 - Auvik continues its investigations and disables the cloud ping service for several large customers on the US3 cluster to prevent excessive alerting once the service is restored.

20:00 - Auvik resets the ping server service on the US3 cluster. Ping services fail over to the backup primary ping server service.

22:25 - The primary ping server service load rises to a level that begins impacting customers on other clusters.

04/01/25

00:00 - The US3 cluster is restarted to revert cloud ping checks to the US3 cluster ping server services. Auvik notifies the customer where the cloud ping checks were disabled that the service will remain down until engineering can confirm they can be enabled without causing excessive alerting.

01:00-01:25 - The US3 cluster fully restarts successfully. Functionality is restored for most clients on the US3 cluster.

12:00-15:30 - Engineering reviews the disabled configurations and disables the responses to the cloud ping check-based alerts.

15:30-15:52 - Auvik validates that all cloud ping check services and alerts are enabled for all customers on the US3 cluster. Additional clean-up commences. The incident is concluded.

Future consideration(s)

  • Auvik is building a new cloud ping check server service for the product. This new server service will be rolled out gradually and is expected to be fully deployed into production over the next month.
  • Our error handling in the service that processes the cloud ping server data has been improved to identify and ignore invalid data.

    • Addresses will no longer be considered offline when invalid data is received.
Posted Apr 09, 2025 - 13:34 EDT

Resolved

Affected Services: Internet Connection Service
Cluster(s):US3

Description:
The issue affecting Internet Connection Ping Checks has been fully resolved. Regular service has been restored, and all systems are operating as expected.

Impact:
Users should no longer experience any issues related to this incident.

Next Steps:
We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.
Posted Apr 01, 2025 - 10:52 EDT

Update

Affected Services: Internet Connection Service
Cluster(s):US3

Description:
Our team has implemented a fix for the issue affecting the Internet connection ping check for the tenants on the US3 cluster, and performance is returning to normal. We monitor the situation to ensure stability and confirm that the service remains fully functional.

Impact:
Services are operating normally for most sites.
We do continue monitoring for irregularities with a few sites that have been contacted

Next Steps:
Tenants on the US3 cluster are still recovering and look healthy.
We are attending to a few sites to regain full functionality.

Thank you for your patience, and we apologize for any inconvenience caused.
Posted Apr 01, 2025 - 09:54 EDT

Update

Affected Services: Internet Connection Service
Cluster(s):US3

Description:
Our team has implemented a fix for the issue affecting the Internet connection ping check for the tenants on the US3 cluster, and performance is returning to normal. We are monitoring the situation to ensure stability and confirm that the service remains fully functional.

Impact:
Services should be operating normally; however, we continue monitoring for irregularities.

Next Steps:
Tenants on the US3 cluster are still recovering and look healthy.
We will continue to monitor the status of the tenants on US3 overnight and report back in the morning.

Thank you for your patience, and we apologize for any inconvenience caused.
Posted Mar 31, 2025 - 21:25 EDT

Monitoring

Affected Services: Internet Connection Service
Cluster(s):US3

Description:
Our team has implemented a fix for the issue affecting the Internet connection ping check for the tenants on the US3 cluster, and performance is returning to normal. We are currently monitoring the situation to ensure stability and confirm that the service remains fully functional.

Impact:
Services should be operating normally; however, we continue monitoring for irregularities.

Next Steps:
Tenants on the US3 cluster are still recovering and look healthy.

Thank you for your patience, and we apologize for any inconvenience caused.
Posted Mar 31, 2025 - 21:11 EDT

Update

Affected Services: Internet Connection Service
Cluster(s):US3

Description:
We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored.

Impact:
The 20-minute maintenance window for the internet connection service for all clusters has been completed.
Services, including other monitoring services, are not impacted.

Next Steps:
The US3 cluster is still going through its restart process.

We sincerely apologize for the extended window for this action.

Thank you for your patience as we work to restore full functionality.
Posted Mar 31, 2025 - 20:22 EDT

Update

We are continuing to work on a fix for this issue.
Posted Mar 31, 2025 - 19:49 EDT

Update

Affected Services: Internet Connection Service
Cluster(s):All Clusters

Description:
We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored.

Impact:
Users may experience false internet connection disconnects.
Services, including other monitoring services, are not impacted.

Next Steps:
Auvik will perform an emergency cluster restart on US3 tenants at 00:00, which will take approximately 1.5 hours to complete.

At this time, Auvik will also perform a 20-minute maintenance window to allow for a restart of the Internet connection service for all of Auvik.

We sincerely apologize for the extended window for this action.

Thank you for your patience as we work to restore full functionality.
Posted Mar 31, 2025 - 19:48 EDT

Update

Affected Services: Internet Connection Service
Cluster(s):US3

Description:
We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored.

Impact:
Users may experience false internet connection disconnects.
Services, including other monitoring services, are not impacted.

Next Steps:
Auvik has disabled alerts for clients on the US3 cluster. This action will continue for an additional hour until 23:00 UTC. This is a preventative measure as we work through false alerts for the internet connection checks. Clients may experience a slowed UI response time during this work. Any UI slowness should be very short, if noticeable at all.

We apologize for the extended window for this action.

Thank you for your patience as we work to restore full functionality.
Posted Mar 31, 2025 - 18:59 EDT

Update

Affected Services: Internet Connection Service
Cluster(s):US3

Description:
We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored.

Impact:
Users may experience false internet connection disconnects.
Services, including other monitoring services, are not impacted.

Next Steps:
Auvik will disable alerts for clients on the US3 cluster for up to 1 hour starting at 22:00 UTC. This is a preventative measure as we work through false alerts for the internet connection checks. Clients may experience a slowed UI response time during this work. This UI slowness should be very short if it is noticeable at all.

We apologize for the late notice.

Thank you for your patience as we work to restore full functionality.
Posted Mar 31, 2025 - 18:19 EDT

Identified

Affected Services: Internet Connection Service
Cluster(s):US3

Description:
We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored.

Impact:
Users may experience false internet connection disconnects.
Services, including other monitoring services, are not impacted.

Next Steps:
Auvik will disable alerts for clients on the US3 cluster for up to 1 hour starting at 22:00 UTC. This is a preventative measure as we work through false alerts for the internet connection checks.

We apologize for the late notice.

Thank you for your patience as we work to restore full functionality.
Posted Mar 31, 2025 - 17:57 EDT

Update

Affected Services: Internet Connection Service
Cluster(s):US3

Description:
We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored.

Impact:
Users may experience false internet connection disconnects.
Services, including other monitoring services, are not impacted.

Next Steps:
We will update you as more information becomes available or by 22:00 UTC.

Thank you for your patience as we work to restore full functionality.
Posted Mar 31, 2025 - 17:02 EDT

Update

Affected Services: Internet Connection Service
Cluster(s):US3

Description:
We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored.

Impact:
Users may experience false internet connection disconnects.
Services, including other monitoring services, are not impacted.

Next Steps:
We will update you as more information becomes available or by 21:00 UTC.

Thank you for your patience as we work to restore full functionality.
Posted Mar 31, 2025 - 16:00 EDT

Update

Affected Services: Internet Connection Service
Cluster(s):US3

Description:
We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored.

Impact:
Users may experience false internet connection disconnects.
Services, including other monitoring services, are not impacted.

Next Steps:
We will update you as more information becomes available or by 20:00 UTC.

Thank you for your patience as we work to restore full functionality.
Posted Mar 31, 2025 - 14:54 EDT

Investigating

Affected Services: Internet Connection Service
Cluster(s):US3

Description:
We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
Users may experience false internet connection disconnects.
Services, including other monitoring services, are not impacted.

Next Steps:
We will update you as more information becomes available or within the next hour.

Thank you for your patience as we work to restore full functionality.
Posted Mar 31, 2025 - 14:15 EDT
This incident affected: Network Mgmt (us3.my.auvik.com).