Service Disruption - Customers with sites on the US1 cluster are not receiving.
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - Clients on the US1 cluster experienced a delay with alerts.

Root Cause Analysis

Duration of incident

Discovered: Mar 11, 2024 13:35 - UTC
Resolved: Mar 11, 2024 16:30 - UTC

Cause

Issues with the March 9th system update.

Effect

The internal alerting engine for the US1 cluster got into a state where alerts were delayed to clients on US1 from March 9th at 11:00 UTC until March 11 at 16:30 UTC.

Action taken

All times in UTC

03/09/2024

11:00 -14:00 - Auvik performs scheduled maintenance on the system. The planned maintenance is extended due to issues with the system during the restart. It is believed the system has recovered successfully.

03/11/2024

08:00 - Auvik Engineering identifies from internal alerting that the alerts on the US1 cluster are lagging when displayed in the system.

08:55 - Engineering restarts the alerting service and attempts to create a new checkpoint for alerting.

10:05 - The new checkpoint failed, and the responsible engineering team was notified about the system delay with the US1 cluster alerts.

13:18 - An external incident was posted to alert clients about the delay. Engineering continues to work on the issue.

14:00 - Progress is achieved by saving checkpoints within the system for the alerting assembler.

15:00 - Alert lag begins to fall.

16:00 - Alert lag has been processed successfully.

16:20 - Engineering declares the incident closed.

Future consideration(s)

  • Auvik will review the checklist of resources and systems after maintenance to properly ensure complete recovery of its systems.
  • Auvik will review and update systems used in its streaming architecture to ensure against possible performance-related issues.
  • Auvik will update its method of cloud-related data systems to handle data lags in the future.
Posted Mar 25, 2024 - 10:44 EDT

Resolved
Delays for alerts for devices and services for customers on the US1 cluster have been resolved. The lag has been removed for alerts, and they are now current. The source of the disruption has been resolved, and services have been fully restored.
Posted Mar 11, 2024 - 12:33 EDT
Update
We’ve identified the source of alerts for devices and services for customers on the US1 cluster and continue to monitor the situation. Alert lag is steadily decreasing. We expect a resolution in the near term. We’ll keep you posted when resolved.
Posted Mar 11, 2024 - 11:36 EDT
Monitoring
We’ve identified the source of the service disruption with alerts for devices and services for customers on the US1 cluster and are monitoring the situation. We’ll keep you posted on a resolution.
Posted Mar 11, 2024 - 10:23 EDT
Identified
We’ve identified the source of the service disruption, with alerts for devices and services for customers on the US1 cluster. We are working to restore service as quickly as possible.
Posted Mar 11, 2024 - 10:14 EDT
Investigating
We’re experiencing disruption to alerts for devices and services for customers on the US1 cluster. We will continue to provide updates as they become available.
Posted Mar 11, 2024 - 09:41 EDT
This incident affected: Network Mgmt (us1.my.auvik.com).