Service Disruption - Devices Deleted from Auvik UI when seen as down in health check
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - Auvik Deleting Devices After Appearing Offline In A Health Check

Root Cause Analysis

Duration of incident

Discovered: Nov 22, 2023, 18:30 - UTC
Resolved: Nov 27, 2023, 18:00 - UTC

Cause

Incorrect message data for networks and IPs was delivered to the device consolidation tables.

Effect

The incorrect message data provided information to delete network and IP information of active devices, which Auvik believed had been removed from the product. This, in turn, deleted devices from clients' tenants. Which in turn also deleted backups and historical data of the deleted devices.

Action taken

All times in UTC

11/22/2023

17:30 - The first noticeable ticket of a significant device loss is sent to engineering. This is followed by two more over the next hour.

18:30 - Auvik declares an incident for the loss of networks and devices.

18:30 - 22:30 - Engineering begins investigating the cause of the incident and how to arrest the deletion of client networks and IPs.

22:30 - Auvik Engineering is able to determine a way to turn off the deletion of client networks and IPs from the platform. A change is implemented into production. This stops the devices from continuing to be deleted.

11/22/2023-11/23/2023

22:30- 00:30 - Auvik Engineering caused the platform to run a discovery of the lost networks and IPs to recreate the devices lost on 11/22. This action did not restore the historical data, backups, and customized alerting from the recreated devices.

11/23/2023

00:30 - Auvik declares the deletion part of the incident closed.

1/23/2023-11/27/2023

00:30 -18:00 - The Auvik consolidation team continues its analysis of the network and IP deletions to backtrack any other devices that may have been deleted before 11/22. It periodically runs scripts to replace lost devices at tenants' sites. While the devices are rediscovered, historical data, backups and customized alerting are not recovered. Measures are put into place to prevent the system from being able to delete devices when receiving incorrect data.

11/27/2023

18:00 - The incident is closed for replacing lost devices.

Future consideration(s)

  • Currently in development with engineering: New tooling to retain device data to restore devices with device history to the original devices.
  • Auvik reviewed its backup frequency to validate the ability to do a restore per day if required. This was validated to work as expected.
  • Auvik will improve internal alerting for mass device, IP, or network removal to gain earlier insight into similar incidents in the future.
Posted Dec 13, 2023 - 09:14 EST

Resolved
The source of the disruption has been resolved, and services have been fully restored.
Posted Nov 22, 2023 - 19:30 EST
Monitoring
We’ve identified the source of the service disruption with devices being deleted from the UI when registered as offline in the health check and are monitoring the situation. Devices are being brought back into the UI. We’ll keep you posted on a resolution.
Posted Nov 22, 2023 - 18:20 EST
Update
We’ve identified the source of the service disruption with devices being deleted from the UI when registered as offline in the health check and are continuing to work to restore service as quickly as possible.
Posted Nov 22, 2023 - 17:26 EST
Update
We’ve identified the source of the service disruption with devices being deleted from the UI when registered as offline in the health check and are continuing to work to restore service as quickly as possible.
Posted Nov 22, 2023 - 16:20 EST
Identified
We’ve identified the source of the service disruption that deleted devices from the UI. We are working to restore service as quickly as possible.
Posted Nov 22, 2023 - 15:23 EST
Investigating
We’re experiencing disruption with devices being deleted from the UI when registered as offline in the health check. We will continue to provide updates as they become available.
Posted Nov 22, 2023 - 14:21 EST
This incident affected: Network Mgmt (my.auvik.com, us1.my.auvik.com, us2.my.auvik.com, us3.my.auvik.com, us4.my.auvik.com, eu1.my.auvik.com, eu2.my.auvik.com, au1.my.auvik.com, ca1.my.auvik.com, us5.my.auvik.com).