Service Disruption - Auvik Network Management Web UI Intermittent Errors
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - Unable to Connect to Auvik Tenants

Root Cause Analysis

Duration of incident

Discovered: Jan 17, 2024 15:27 - UTC
Resolved: Jan 18, 2024 04:30 - UTC

Cause

A CORE settings change was implemented on 50% of Auvik clients after a successful initial rollout to 5% of Auvik clients the day before.

Effect

Clients that were part of the 50% under the setting change became inaccessible. A percentage of these clients disabled themselves from being activated due to the number of attempted restarts that accompanied the disconnection.

Action taken

All times in UTC

01/17/2024

15:27 – Auvik Engineering enables the same CoreSettings for 50% of tenants after a successful dry run from the previous day with 5% of its clients.

15:35 – Internal Auvik alerts notify Engineering of a significant service disruption.

15:36 – Engineering begins its Investigation.

15:39 – The backend services of the clients where changes were implemented stop reporting metrics.

15:53 – Engineering reverts the change that was implemented.

16:30 – Engineering manually begins restarting clusters of the affected clients.

18:40 – Engineering begins manually repairing the connections to back-end services of clients that are not starting or reporting metrics properly.

21:00 – All clusters are recovered. Engineering is seeing successful reporting of services and believes the incident to be over. The incident is marked as resolved on the Status page.

21:51 – Auvik Support receives notice that one of the affected client’s tenants has been unexpectedly disabled.

2024-01-18

01:31 – Auvik continues to receive more client reports of unexpectedly disabled tenants.

02:30 – The Auvik Engineering On-Call team is engaged.

03:37 – Engineering determines the number of tenants unexpectedly disabled to be just over 1000.

03:50 – Engineering re-enables the disabled tenants.

04:30 – The number of running tenants is back to its pre-incident level. This incident is officially closed.

Future consideration(s)

  • Auvik will review and adjust its rollout process & guidelines. Enforcement and training on the updated process will be implemented.
  • Recovery procedures for resolving an incident have been updated to check for unexpected deactivation of clients.
Posted Jan 27, 2024 - 08:06 EST

Resolved
The fix for intermittent errors and disconnections has been applied. The source of the disruption has been resolved, and services have been fully restored.

A Root Cause Analysis (RCA) will follow after a full review has been completed.
Posted Jan 17, 2024 - 16:34 EST
Update
We are continuing to monitor for any further issues on the remaining clusters.
Posted Jan 17, 2024 - 15:45 EST
Update
We’ve identified the source of the service disruption with Auvik Network Management. Some customers may still experience intermittent errors when accessing the web UI, data processing or including collectors disconnecting and reconnecting intermittently. We are monitoring the situation. We are currently implementing a fix for the involved issues.
We appreciate your patience as we continue to work through the issues. We’ll keep you posted on a resolution.
Posted Jan 17, 2024 - 15:44 EST
Update
We’ve identified the source of the service disruption with Auvik Network Management. Some customers may still experience intermittent errors when accessing the web UI, data processing or including collectors disconnecting and reconnecting intermittently. We are monitoring the situation. We are currently implementing a fix for the involved issues.
We appreciate your patience as we continue to work through the issues. We’ll keep you posted on a resolution.
Posted Jan 17, 2024 - 15:04 EST
Update
We’ve identified the source of the service disruption with Auvik Network Management. Some customers may still experience intermittent errors when accessing the web UI, data processing or including collectors disconnecting and reconnecting intermittently. We are monitoring the situation. We are currently implementing a fix for the involved issues.
We appreciate your patience as we continue to work through the issues. We’ll keep you posted on a resolution.
Posted Jan 17, 2024 - 14:03 EST
Update
We’ve identified the source of the service disruption with Auvik Network Management. Some customers may still experience intermittent errors when accessing the web UI, data processing or including collectors disconnecting and reconnecting intermittently. We are monitoring the situation. We are currently implementing a fix for the involved issues. We’ll keep you posted on a resolution.
Posted Jan 17, 2024 - 13:13 EST
Monitoring
We’ve identified the source of the service disruption with Auvik Network Management. Some customers may experience intermittent errors when accessing the web UI or data processing and are monitoring the situation. We are currently implementing a fix for the involved issues. We’ll keep you posted on a resolution.
Posted Jan 17, 2024 - 11:50 EST
Update
We’ve identified the source of the service disruption to Auvik Network Management. We are working to restore service as quickly as possible.
Posted Jan 17, 2024 - 11:42 EST
Identified
We’ve identified the source of the service disruption with Auvik Network Management. Some customers may experience intermittent errors when accessing the web UI or the processing of data. We are working to restore service as quickly as possible.
Posted Jan 17, 2024 - 11:42 EST
Investigating
We’re experiencing disruption to Auvik Network Management. Some customers may experience intermittent errors when accessing the web UI or the processing of data. We will continue to provide updates as they become available.
Posted Jan 17, 2024 - 11:00 EST
This incident affected: Network Mgmt (my.auvik.com, us1.my.auvik.com, us2.my.auvik.com, us3.my.auvik.com, us4.my.auvik.com, eu1.my.auvik.com, eu2.my.auvik.com, au1.my.auvik.com, ca1.my.auvik.com, us5.my.auvik.com).