Service Disruption - Auvik Dashboard in us1, us4 and au1

Incident Report for Auvik Networks Inc.

Postmortem

Service Disruption - Hierarchical Data Display Issues

Root Cause Analysis

Duration of incident

Discovered: July 26, 2025 14:00 UTC
Resolved: July 31, 2025 12:45 UTC

Cause

Following a core system upgrade on July 26, 2025, multiple clusters experienced degraded performance due to unexpected issues during service initialization. This led to corruption in certain hierarchical data structures, which in turn impacted various user experiences.

A core infrastructure failure caused an overload in internal system processes, preventing certain backend services from initializing properly. As a result, critical hierarchical user-role data and related settings failed to load or loaded incorrectly across several environments.

Effect

The disruption impacted customers across multiple regions. Effects included:

  • Missing custom settings such as alert preferences and interface configurations.
  • Alert notifications being sent to incorrect recipients.
  • Inability to access dashboards or view accurate site data.
  • Site maps failing to render correctly.
  • Login issues for end-users in some environments.
  • Inconsistent or missing hierarchical relationships in site selectors.
  • Temporary loss of monitoring due to disassociated shared collectors.
  • These issues collectively degraded service functionality, limited access for internal support teams, and disrupted monitoring workflows.

Action taken

All times are in UTC

07/26/2025

11:00 – Core upgrade initiated.

13:27 – Initial service failures observed. Engineering begins recovery processes.

14:19 – Backend query failures reported. Engineering continues to recover and stabilize services.

16:24 – Incident response team mobilized.

07/27/2025
(Services restored throughout the day)

Hierarchy services replayed and clusters rebooted to stabilize services.

Alerting system functionality restored.

7/28/2025
(Services restored throughout the day)

Shared monitoring agents reassociated.

Back-end service migrations initiated for affected tenants.

7/28/2025
(Services restored throughout the day)

Repair scripts run to reset affected data and restore processing pipelines.

7/30/2025-07/31/2025
(Services restored throughout the day)

Corrupted tenants identified and corrected.
All affected clusters verified for service integrity.

07/31/2025
12:45 - Incident considered resolved.

Future consideration(s)

  • Improve validation checks during post-upgrade procedures to avoid cascading service impacts.
  • Temporarily pause specific background services (e.g., data cleaners and processors) during upgrades until core services are stable.
  • Implement automated detection for corrupted tenant hierarchies or missing role-based configurations.
  • Revisit default alert notification behavior to avoid unintended mass-notifications.
Posted Aug 11, 2025 - 11:40 EDT

Resolved

The incident has been fully resolved. Regular service has been restored, and all systems operate as expected.

Impact:
Users should no longer experience any issues related to this service disruption.

[RCA]
We thank you for your understanding. If you continue to experience issues, please don't hesitate to contact our support team.
We will post an RCA after an internal investigation.
Posted Jul 27, 2025 - 07:51 EDT

Monitoring

Our team has implemented a fix for the disruption in us4, and the services are returning to normal. We continue to monitor the situation to ensure stability and confirm that the service remains fully functional.
Posted Jul 26, 2025 - 23:36 EDT

Update

au1 is now operational. Recovery efforts remain ongoing in region us4, where a maintenance window was initiated at 09:38 PM UTC to support remediation. We’ll continue to share updates as we make progress.
Posted Jul 26, 2025 - 18:09 EDT

Update

us1 is now operational. We are continuing to work through partial issues in au1 and us4, where some sites are still experiencing service disruptions.
Posted Jul 26, 2025 - 16:12 EDT

Identified

We’ve identified a potential cause of the issue and are actively working on a fix. Some sites are beginning to load again, but intermittent issues may still persist for some users.
Posted Jul 26, 2025 - 15:42 EDT

Update

We are continuing to investigate. We are rebooting all services in us4. Services are partially available in us1 and au1.
Posted Jul 26, 2025 - 13:20 EDT

Investigating

We are currently experiencing a service disruption. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
Users may experience that sites are not loading properly and intermittent disruption in monitoring.

Next Steps:
We will update this information as more details become available.

We appreciate your patience as we work to restore full functionality.
Posted Jul 26, 2025 - 12:43 EDT
This incident affected: Network Mgmt (us1.my.auvik.com, us4.my.auvik.com, au1.my.auvik.com).