Service Disruption - Orphaned IPs on cluster US4
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - Devices with Orphaned IPs in Cluster US4

Root Cause Analysis

Duration of incident

Discovered: Dec 6, 2023, 15:45 - UTC
Resolved: Dec 7, 2023, 01:30 - UTC

Cause

An internal service that injects network and IP data into the product was in a crash loop. (Repartioner Service)

Effect

This caused the Consolidation services that attach IPs to devices to understand the IP had been deleted. This mismatch of data then caused the devices to lose their association with their actual IPs, resulting in orphaned devices.

Action taken

All times in UTC

12/05/2023

16:15 - Backend services related to the Juniper Mist Release to GA on the US4 cluster are beginning to report errors. The backend Repartioner service fell into a crash loop.

12/06/2023

15:45 - Auvik Support reports a client has devices with what appear to be deletions of attached IPs. Several more tickets follow in quick succession. Engineering is alerted to the issue and begins its investigation.

16:30 - An incident is declared and posted to the Auvik status page. Engineering continues to investigate the cause. Engineering turns off the consolidation engineer on the US4 cluster to prevent any more deletions.

16:30 -17:00 - Engineering identifies the Repartioner service is crashing, looping, and restarts the service successfully. It is determined the Repartioner service needs more resources to process the accumulated data lag from the last day. Additional resources are provisioned.

17:00 - The lag is processed through the Repartioner service. The processed data is now attempting to catch up with the production environment.

17:30 - Injecting the delayed data back into the product on the US4 cluster will take a while. Adjustments to US4 cluster processing services are made to allow the lagged data to catch up more expediently. It is noted that devices with orphaned IPs are recovering.

12/06/2023 -12/07/2023

17:30 - 1:30 - Engineering monitors the data lag decrease and validates the data can catch up.

12/07/2023

01:30 - Data lag for the IP and network data on cluster US4 has caught up.

09:41 - The Auvik status page posts that the incident has been closed.

Future consideration(s)

  • Improved monitoring of legacy services (Repartioner) will be implemented to prevent long-duration issues from occurring with action taking place.
  • Auvik will add greater resilience to the Consolidation services to prevent orphaning large amounts of device networking data.
Posted Dec 17, 2023 - 20:14 EST

Resolved
The resolution with devices with orphaned IPs on Cluster US4 has been completed. The source of the disruption has been resolved, and services have been fully restored.

A Root Cause Analysis (RCA) will follow after a full review has been completed.
Posted Dec 07, 2023 - 04:41 EST
Update
We’ve identified the source of the service disruption with devices with orphaned IPs on Cluster US4 and continue to monitor the situation. There will continue to be a delay for data to catch up in the UI. The lag catch-up has proceeded more slowly than anticipated. The new estimated time for the lag to become current is now at some point early in the morning December 7th EST. We apologize for this delay. We’ll keep you posted on a resolution.
Posted Dec 06, 2023 - 17:00 EST
Update
We’ve identified the source of the service disruption with devices with orphaned IPs on Cluster US4 and continue to monitor the situation. There will continue to be a delay for data to catch up in the UI. The estimated time for the lag to become current is 23:00 UTC or 6:00 PM EST. We’ll keep you posted on a resolution.
Posted Dec 06, 2023 - 13:41 EST
Monitoring
We’ve identified the source of the service disruption with devices with orphaned IPs on Cluster US4 and are monitoring the situation. There will continue to be a delay for data to catch up in the UI. We’ll keep you posted on a resolution.
Posted Dec 06, 2023 - 13:21 EST
Identified
We’ve identified the source of the service disruption with devices with orphaned IPs on Cluster US4. We are working to restore service as quickly as possible.
Posted Dec 06, 2023 - 12:32 EST
Investigating
We’re experiencing disruption with devices with orphaned IPs on Cluster US4. We will continue to provide updates as they become available.
Posted Dec 06, 2023 - 11:54 EST
This incident affected: Network Mgmt (us4.my.auvik.com).