Service Disruption - Disruption of IP associations with devices on approximately 11% of devices on US5 Cluster.
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - Disruption of IP associations with devices for some US5 and US1 cluster clients.

Root Cause Analysis

Duration of incident

Discovered: Mar 14, 2024 09:15 - UTC
Resolved: Mar 16, 2024 01:20 - UTC

Cause

After the incident on March 14th, approximately 2,000 tenants who had previously been migrated reprocessed steps in the initial migration.

Effect

Clients affected by the incident on the US5 cluster lost 134,727 IP addresses (Approximately 10% of devices across the affected tenants). The US1 cluster had five tenants who experienced similar issues.

Action taken

All times in UTC
03/14/2024

21:00 - Cluster recovery from the March 14th incident leads to unexpected tenant migrations.

03/15/2024

13:23 - The relevant Auvik engineering team is informed of the issue with a specific client.

13:30 - The cause is misdiagnosed, and the tenant is restarted to address the issue.

14:00 - The restart does not resolve the issue, and a deeper investigation into the reason for the problem is begun.

16:45 - The engineering team discovers that the cause of the issue is an unexpected rerun of tenant migrations that were kicked off from the previous day’s incident.

16:55 - A plan is developed to reset IPs lost IPs against affected devices. This action will only reattach IPs to the proper device. Previous configuration customizations, backups, or alerting will be lost with the reconsolidation of the devices.

17:07 - The Auvik engineering team kicks off a systematic reattachment of deleted IPs.

03/16/2023

01:18 - The engineering team finished the reattachment of the removed IPs.

02:20 - The incident is declared closed.

Future consideration(s)

  • Auvik will develop improved safeguards around tenant restarts and migrations.
  • Auvik will deploy an improved safety configuration to restore lost configuration data from IP reassignments that cause device reconsolidation.
Posted Mar 25, 2024 - 11:05 EDT

Resolved
We’ve identified the source of the service disruption to IPs associated with devices. The disruption has been resolved, and services have been fully restored.

A Root Cause Analysis (RCA) will follow after a full review.
Posted Mar 15, 2024 - 20:20 EDT
Update
We’ve identified the source of the service disruption to IPs associated with devices on the US5 cluster. Approximately 11% of devices on US5 and five sites on US1 are not appearing. The tenants affected on the US1 cluster have had services restored. The tenants on the US5 cluster are having their services systematically restored. The ETA for recovering services to all affected tenants is within the next three hours. We will post when it is completed or if there is any change in outlook.
Posted Mar 15, 2024 - 16:41 EDT
Update
We’ve identified the source of the service disruption to IPs associated with devices on the US5 cluster. Approximately 11% of devices on US5 and five sites on US1 are not appearing. We are continuing to restore service as quickly as possible.
Posted Mar 15, 2024 - 15:09 EDT
Identified
We’ve identified the source of the service disruption to IPs associated with devices on the US5 cluster. Approximately 11% of devices on US5 and devices on 5 sites on US1 are not appearing . We are working to restore service as quickly as possible.
Posted Mar 15, 2024 - 14:07 EDT
Investigating
We’re experiencing disruption to IPs associated with devices on the US5 cluster. We will continue to provide updates as they become available.
Posted Mar 15, 2024 - 13:23 EDT
This incident affected: Network Mgmt (us1.my.auvik.com, us5.my.auvik.com).