Service Disruption - Auvik clients are experiencing a disruption of services - Multiple Clusters

Incident Report for Auvik Networks Inc.

Postmortem

Service Disruption - Intermittent Availability & Performance Issues Across Multiple Clusters

Root Cause Analysis

Duration of the incident

Discovered: Aug 25, 2025 18:00 - UTC
Resolved: Aug 29, 2025 14:00 - UTC

Cause

A configuration rollout unexpectedly generated a large number of configuration entries, which propagated across tenants. This resulted in excessive background processing and memory pressure in core services. The strain led to degraded performance, instability, and in some cases, brief service crashes across clusters.

Effect

Customers experienced:

Intermittent access and sign-in issues in several regions
Slow page loads and missing/delayed alert notifications
Errors or gaps in specific dashboard and visualization views
Temporary unavailability for a small number of tenants

Action taken

All times are in UTC

08/25/2025
18:00 — Rollout halted after error rates increased.
19:00 — Targeted service restarts restored partial availability.
22:00 — Added backend capacity and began controlled rollouts.

08/26–08/28/2025

Continued staged rollouts with adjusted capacity.
Cleaned up configuration entries for affected tenants.
Tuned resource allocations for read/permissioning services

08/29/2025

14:00 — All clusters stabilized; monitoring confirmed normal performance.

Future consideration(s)

Enhance autoscaling and resource thresholds for services under heavy background processing.
Add scale-aware pre-deployment validation for configuration rollouts.
Refine monitoring to surface customer-visible issues earlier.
Expand operational runbooks for rollback and tenant recovery.

Posted Sep 02, 2025 - 12:39 EDT

Resolved

The incident has been fully resolved, and services are operating as expected.

Impact:
Customers should no longer experience any related issues. If you continue to experience problems, please report them to Auvik Support.

We will be posting an RCA as a follow-up.

Posted Aug 27, 2025 - 10:13 EDT

Investigating

Posted Aug 27, 2025 - 10:09 EDT

Update

We have implemented changes to address the outstanding issues, and services are currently operating as expected. As a precaution, all clusters are being closely monitored to ensure continued stability. This will continue throughout the day.

Impact:
Services should be functioning normally. If you continue to experience any issues, please contact Auvik Support.

Next Steps:
We will continue to monitor and share any further updates as necessary.

Posted Aug 27, 2025 - 04:49 EDT

Update

We have implemented changes to address the outstanding issues, and services are currently operating as expected. As a precaution, all clusters are being closely monitored to ensure continued stability. This will continue throughout the evening.

Impact:
Services should be functioning normally. If you continue to experience any issues, please contact Auvik Support.

Next Steps:
We will continue to monitor and share any further updates as necessary.

Posted Aug 26, 2025 - 18:22 EDT

Monitoring

The sites in the EU2 cluster have recovered.
Some sites on the US6 cluster experienced missing data in the UI (e.g., Maps, Devices). This has also been addressed.

We will be performing rolling maintenance on sites on the AU1 cluster, during which the site may experience a momentary disconnection. Collectors may need to reconnect to Auvik.

Impact:
Customers may continue to experience slowness and may have limited access to their sites on the affected clusters.
Please report any related issues to Auvik Support so we can track and assist further.

Posted Aug 26, 2025 - 13:02 EDT

Update

The sites in the EU2 cluster are recovering.
Some sites on the US6 cluster are experiencing missing data in the UI (Maps, Devices, etc). This is being addressed.

Impact:
Customers may continue to experience slowness and may have limited access to their sites on the affected clusters.
Please report any related issues to Auvik Support so we can track and assist further.

Posted Aug 26, 2025 - 12:34 EDT

Identified

Our team has identified a suspected cause of the slowness and permission errors in EU2 and is taking steps to remediate the issue.

Impact:
Customers may continue to experience slowness and possible access to their sites.
Please report any related issues to Auvik Support so we can track and assist further.

Next Steps: We are applying mitigation measures and will provide updates on progress.

Posted Aug 26, 2025 - 11:54 EDT

Update

Several tenants in the EU2 cluster may have experienced an interruption in service, including Collector disconnects.

Impact:
Services should be returning to normal. If you continue to experience any problems, please contact Auvik Support.

Next Steps:
We will continue monitoring and will share any further updates if necessary.

Posted Aug 26, 2025 - 11:47 EDT

Update

We have implemented changes to address the outstanding issues, and services are currently operating as expected. As a precaution, all clusters are being closely monitored to ensure continued stability.

Impact:
Services should be functioning normally. If you continue to experience any problems, please contact Auvik Support.

Next Steps:
We will continue monitoring and will share any further updates if necessary.

Posted Aug 26, 2025 - 10:10 EDT

Update

US2 and US5 clusters are running and stable.
The AU1 cluster had some slowness over the evening, which has been resolved.
Clients in the US4 cluster have had some reported lag in services that is currently under investigation.

Impact: US4 is experiencing possible slowness with load times and access to the lag.

If you continue to encounter problems, please report them to Auvik Support.

Posted Aug 26, 2025 - 07:16 EDT

Update

We have applied changes to address the issue on the US2 and US5 clusters. Site access is restored. Services appear to be still recovering, but we are monitoring closely for stability.

Impact: Information for tenants in the US2 and US5 clusters is still experiencing a delay in the product and will continue to recover. We will be monitoring the services into the evening.
If you continue to encounter problems, please report them to Auvik Support.

Next Steps: A final update will be posted once we confirm resolution.

Posted Aug 25, 2025 - 17:56 EDT

Monitoring

We have applied changes to address the issue on the US2 and US5 clusters. Site access should be restored. Services appear to be still recovering, but we are monitoring closely for stability.

Impact: Information for tenants in the US2 and US5 clusters is still experiencing a delay in the product.
If you continue to encounter problems, please report them to Auvik Support.

Next Steps: A final update will be posted once we confirm resolution.

Posted Aug 25, 2025 - 17:37 EDT

Update

Our team has identified a suspected cause of the service disruption on the US2 and US5 clusters and is taking steps to remediate the issue.

Impact: Customers may continue to experience connectivity issues as we deliberately bring the US2 and US5 clusters back into production. The estimated time for recovery of these clusters has been extended.
Please report any possible related issues to Auvik support.

Next Steps: We are applying mitigation measures and will provide updates on progress.

Posted Aug 25, 2025 - 16:43 EDT

Identified

Our team has identified a suspected cause of the service disruption on the US2 and US5 clusters and is taking steps to remediate the issue.

Impact: Customers may continue to experience connectivity issues as we bring the US2 and US5 cluster back into production.
Please report any possible related issues to Auvik support.

Next Steps: We are applying mitigation measures and will provide updates on progress.

Posted Aug 25, 2025 - 16:07 EDT

Update

Affected Services: Clients are not accessible
Cluster(s): US2 and US5

Description:
We are currently experiencing degraded services. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
Users may experience an inability to access their tenants.

We are currently performing a cluster restart on the US2 cluster. Down time is expected to be 1-1.5 hours.
We are now also performing a cluster restart on the US5 cluster. Down time is expected to be 1-1.5 hours.

The other clusters appear not to be affected. Monitoring and alerting are working on them

Next Steps:
We will update this information as more details become available.

Posted Aug 25, 2025 - 15:07 EDT

Update

Affected Services: Clients are not accessible
Cluster(s): All Clusters

Description:
We are currently experiencing degraded services. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
Users may experience an inability to access their tenants.

We are currently performing a cluster restart on the US2 cluster. Down time is expected to be 1-1.5 hours.
We are investigating the other clusters.

Next Steps:
We will update this information as more details become available.

We appreciate your patience as we work to restore full functionality.

Posted Aug 25, 2025 - 14:41 EDT

Investigating

Affected Services: Clients are inaccessible
Cluster(s): All Cluster

We are currently experiencing a service disruption. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
It is still being determined.

Next Steps:
We will update this information as more details become available.

We appreciate your patience as we work to restore full functionality.

Posted Aug 25, 2025 - 14:32 EDT

This incident affected: Network Mgmt (us1.my.auvik.com, us2.my.auvik.com, us3.my.auvik.com, us4.my.auvik.com, us5.my.auvik.com, us6.my.auvik.com, eu1.my.auvik.com, eu2.my.auvik.com, au1.my.auvik.com, ca1.my.auvik.com).