Performance Degraded - Information rendering in the Auvik User interface for tenants on the US3 cluster

Incident Report for Auvik Networks Inc.

Postmortem

Service Degraded - API and UI interruption on the US3 Cluster

Root Cause Analysis

Duration of incident

Discovered: Jul 3, 2025, 17:10 UTC
Resolved: Jul 3, 2025, 18:30 UTC

Cause

A service component in the US3 region experienced a critical failure that triggered a crash loop, rendering an internal service inoperable.

Effect

Users experienced disruption to both the user interface and API functions in the US3 environment.

Action taken

All times are in UTC

07/03/2025

17:10 Deployment of user interface (UI) change.

17:30 Engineering alerted to issues with the UI and APIs on US3.

17:40 Attempted rollback of the most recent deployment version.

17:50 Crash loop persisted despite rollback, indicating the issue was not caused by regression.

17:52 Engineering alerted to issues with the UI and APIs on US3. Introduced more detailed diagnostic logging and restarted the affected services.

18:12 Comparison against other environments revealed the same error had occurred without incident elsewhere, suggesting environmental context was a key factor.

18:30 Services stabilized and returned to normal.

Future consideration(s)

  • Review environmental differences that contributed to varying behavior between clusters.
  • Consider adding a safe-fail mechanism for deployments to prevent full-service crash loops.
Posted Jul 29, 2025 - 13:56 EDT

Resolved

Affected Services: Access to part of the UI
Cluster(s): US3

Description:
The performance issue affecting access to the UI has been fully resolved, and normal operations have resumed. All systems are functioning as expected.

Impact:
Users should no longer experience any performance-related issues.

Next Steps:
Service has been restored. We apologize for the disruption and appreciate your continued patience. If you continue to experience issues, please contact our support team.
Posted Jul 03, 2025 - 14:44 EDT

Monitoring

User Affected Services: Access to part of the UI
Cluster(s): US3

Description:
Our team has implemented a fix for the performance issues affecting the User interface. We monitor the system to ensure stability and confirm that performance remains at expected levels.

Impact:
System performance should be restored to normal. We will continue to monitor for any irregularities.
Services [services not affected] are not impacted.

Next Steps:
A final update will be provided once we confirm the issue is resolved.

We appreciate your patience as we work through this issue.
Posted Jul 03, 2025 - 14:25 EDT

Identified

Affected Services: Access to part of the UI
Service not impacted: Monitoring and alerting functionality

Description:
We are currently experiencing degraded performance with access to the user Interface, and are currently investigating a solution to restore normal service levels.

Impact:
While we work on the resolution, users may continue to experience interruptions in the User interface.

Next Steps:
We will provide updates as the situation progresses.

Your patience is greatly appreciated, and we regret any inconvenience you may be experiencing.
Posted Jul 03, 2025 - 13:35 EDT

Investigating

Affected Services: Access to part of the UI
Service not impacted: Monitoring and alerting functionality

Description:
We are currently experiencing degraded performance with access to the user Interface. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
Users may experience issues with their user interface.

Next Steps:
We will update this information as more details become available.

Thank you for your patience as we work to restore full functionality.
Posted Jul 03, 2025 - 13:14 EDT
This incident affected: Network Mgmt (us3.my.auvik.com).