Performance Issue - Device Discovery Delayed
Incident Report for Auvik Networks Inc.
Postmortem

Performance Disruption - Delays with New Device Discovery

Root Cause Analysis

Duration of incident

Discovered: Apr 25, 2024 14:00- UTC
Resolved: Apr 26, 2024 01:30- UTC

Cause

Changes were placed into production to address findings from the Auvik March 15, 2024, incident. The changes were not behind a feature flag to prevent them from affecting production data.

Effect

The changes were not granted proper permissions, which caused a data crash loop. This delayed newly discovered devices.

Action taken

All times in UTC
04/24/2024

14:00-17:30 - Updated code merged into production code to address the bug discovered in the Auvik March 15, 2024, incident.

4/25/2024

14:35 - An approved tenant migration causes a crash loop of data for newly discovered devices.

18:04 - The Auvik engineering team responsible for the implemented change is made aware of the crash loop and delay in rendering new devices in the product.

18:17 - Engineering determines the cause of the crash loop and adjusts permissions for the implemented changes.

18:23 - The changes implemented for permissions have the desired effect, and consumer lag begins to improve. Data will be delayed as the lag catches up to the live production data.

4/26/2024

01:30 - Consumer lag fully recovers, and all data is current. The incident is closed.

Future consideration(s)

  • Changes have been implemented to adjust service account permissions for improvements to code automatically.
  • An internal review was performed on the review of code changes and approval processes for production.
  • Adjustments to internal alerting are reviewed to highlight the prioritization of production-impacted changes.
Posted May 08, 2024 - 09:43 EDT

Resolved
The delay for device discovery has been resolved. The source of the performance impact has been addressed, and performance should again be optimal.

A Root Cause Analysis (RCA) will follow after completing a full review.
Posted Apr 25, 2024 - 21:36 EDT
Monitoring
We’ve identified the source of the performance issue with delays in new device discovery and are monitoring the situation. We've implemented the fix and are waiting for device information to catch up in the system. As the lag catches up, we expect to be back to optimal performance in a few hours. We’ll keep you posted on a resolution.
Posted Apr 25, 2024 - 14:49 EDT
Identified
We’ve identified the source of the performance issue in the discovery of new devices. We are working to restore optimal service as quickly as possible.
Posted Apr 25, 2024 - 14:30 EDT
This incident affected: Network Mgmt (my.auvik.com, us1.my.auvik.com, us2.my.auvik.com, us3.my.auvik.com, us4.my.auvik.com, eu1.my.auvik.com, eu2.my.auvik.com, au1.my.auvik.com, ca1.my.auvik.com, us5.my.auvik.com).