Service Degraded - Discovery Consolidation on US6 Cluster

Incident Report for Auvik Networks Inc.

Postmortem

Service Degraded - Newly discovered devices and consolidation are not working for clients on the US6 cluster.

Root Cause Analysis

Duration of incident

Discovered: Feb 02, 2023 17:00 - UTC
Resolved: Feb 02, 2023 21:30 - UTC

Cause

A reorganization of engineering caused a permission change for tenant migrations.

Effect

This change caused permission issues with a tenant migration to another cluster, which, in turn, also caused problems with consolidation on the same cluster.

Action taken

All times in UTC

02/03/2025

16:58 – A tenant is migrated off of the US6 cluster.

18:00 – Engineering is aware of consolidation issues for clients on the US6 cluster and begins investigating.
20:28 – The initial cause for the interruption is determined. Engineering disables the migration service.
20:34 – The tenant migration that caused the issues is identified.
20:47 – The root cause of the interruption of services is identified.
21:15 – The underlying issues that caused the service interruption are fixed.
21:45 – Tenant migration is re-enabled tenant migrations in the consolidation service.
22:01 – The problematic tenant is successfully migrated.
22:39 – All services are confirmed to be running as intended.

Future consideration(s)

  • Auvik is reviewing permission changes that have occurred and validating tests of the blast radius of the changes.

    • Any changes will have full comments and documentation created to follow the changes better.
  • Auvik will set up a test migration regularly to validate tenant migration functionality.

Posted Feb 10, 2025 - 10:34 EST

Resolved

Affected Services: Discovery Consolidation
Cluster(s): US6

Description:
The issue affecting Discovery Consolidation has been fully resolved. Normal service has been restored, and all systems are now operating as expected.

Impact:
Users should no longer experience any issues related to this incident.

Next Steps:
We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.
Posted Feb 03, 2025 - 16:56 EST

Monitoring

Affected Services: Discovery Consolidation
Cluster(s): US6

Description:
Our team has implemented a fix for the issue affecting the consolidation of devices, and the performance consolidation of devices is returning to normal. We are monitoring the situation to ensure stability and confirm that the service remains fully functional.

Impact:
Service is returning to normal; however, we continue monitoring for irregularities.
Services Alerting was not impacted.

Next Steps:
We will provide a final update once we confirm the issue is fully resolved.

Thank you for your patience, and we apologize for any inconvenience caused.
Posted Feb 03, 2025 - 15:42 EST

Identified

Affected Services: Discovery Consolidation
Cluster(s): US6

Description:
Our team has identified the root cause of the degraded performance affecting new device discovery consolidation.
Cluster(s): US6. We are currently investigating a solution to restore normal service levels.

Impact:
While we work on the resolution, users will continue to experience device discovery and consolidation.
Services: Alerting is not impacted.

Next Steps:
Our team is actively working to resolve the issue and will provide updates as progress is made or within the next hour

Thank you for your patience as we work to restore full functionality.
Posted Feb 03, 2025 - 15:31 EST

Update

Affected Services: Discovery Consolidation
Cluster(s): US6

Description:
We are currently experiencing degraded performance with the consolidation of devices. Our team is still actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
Users may experience issues with new device discovery and consolidation.
Services: Alerting is not impacted.

Next Steps:
We will provide updates as more information becomes available or by 20:30 UTC.

Thank you for your patience as we work to restore full functionality.
Posted Feb 03, 2025 - 14:42 EST

Investigating

Affected Services: Discovery Consolidation
Cluster(s): US6

Description:
We are currently experiencing degraded performance with the consolidation of devices. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
Users may experience issues with new device discovery and consolidation.
Services: Alerting is not impacted.

Next Steps:
We will provide updates as more information becomes available or by 19:30 UTC.

Thank you for your patience as we work to restore full functionality.
Posted Feb 03, 2025 - 14:23 EST
This incident affected: Network Mgmt (us6.my.auvik.com).