Service Disruption - network disconnection alerts
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - Cloud Ping Check Not Responding

Root Cause Analysis

Duration of incident

Discovered: Jan 16, 2024, 21:40 - UTC
Resolved: Jan 16, 2024, 22:33 - UTC

Cause

There was a significant spike in CPU/memory resources for the ping services in the product.

Effect

Auvik clients with Internet connection checks enabled received a large volume of connection alert failures.

Action taken

All times in UTC

01/16/2024

21:17 - Auvik Support alerted Auvik Engineering of a sudden influx of tickets concerning failed Internet connection checks

21:27 - Engineering confirms there was no disruption to the number of connected agents

21:31 - Engineering confirms there has been an escalation in CPU/memory for the ping server

21:48 - A broken backend connection was deleted and recreated.

21:56 - Engineering confirms that resource demands start to decrease and manually confirms clients that reported connection alerts are now responding

22:08 - Engineering confirms with its alerting team that there’s no manual intervention needed for the alerts that were fired; they will resolve themselves

22:33 - Incident has been resolved - alerts resolved themselves, and resources decreased to expected values for the affected service.

Future consideration(s)

  • Auvik will create internal alerting for the Ping services.
  • Auvik will create a failover instance of the Ping service to prevent a single point of failure situation in the future.
Posted Jan 27, 2024 - 07:51 EST

Resolved
The source of the disruption has been resolved, and services have been fully restored.
Posted Jan 16, 2024 - 17:32 EST
Monitoring
We’ve identified the source of the service disruption with network disconnection alerts and implemented a fix. We are monitoring the situation.
Posted Jan 16, 2024 - 17:11 EST
Identified
We’ve identified a service disruption with alerts for network disconnection. Some customers may receive erroneous network disconnection alerts. We are working to restore service as quickly as possible.
Posted Jan 16, 2024 - 16:48 EST
This incident affected: Network Mgmt (my.auvik.com, us1.my.auvik.com, us2.my.auvik.com, us3.my.auvik.com, us4.my.auvik.com, eu1.my.auvik.com, eu2.my.auvik.com, au1.my.auvik.com, ca1.my.auvik.com, us5.my.auvik.com).