Service Degraded - Clients on the EU1 cluster using V2 alerting are not reviewing device alerts

Incident Report for Auvik Networks Inc.

Postmortem

Service Degraded - Clients on the EU1 cluster using V2 alerting are not reviewing device alerts.

Root Cause Analysis

Duration of the incident

Discovered: Aug 21, 2025 23:47 - UTC
Resolved: Aug 22, 2025 12:00 - UTC

Cause

A change to the alert-processing timing logic introduced a defect where time windows did not close properly. This prevented events from being processed promptly, causing alerts to queue up and delaying their delivery to the user interface.

Effect

Customers on the EU1 cluster using V2 alerting experienced delays in reviewing device alerts, with some alerts being delayed by up to 12 hours.

Action taken

All times are in UTC

08/21/2025

23:47 – Alert processing began lagging; backlog started building.

08/22/2025

06:00 – Incident declared; engineers engaged to investigate.

12:00 – Adjustments made to processing pipeline; backlog cleared; all delayed alerts reprocessed; service restored.

Future consideration(s)

  • Add monitoring for time-window stalls and backlog growth.
  • Expand testing to cover out-of-order and skewed event scenarios.
  • Strengthen rollback plans for all future alert-processing changes.
Posted Sep 08, 2025 - 09:39 EDT

Resolved

The incident has been fully resolved. Regular service has been restored, and all systems are operating as expected.

Impact:
Users should no longer experience any issues related to this incident.
If you are still experiencing issues, please do not hesitate to reach out to the support team and update your ticket or report any problems you haven't reported yet.

Service has been fully restored. We apologize for the degradation in services. We thank you for your understanding. If you continue to experience issues, please don't hesitate to contact our support team.
We will post an RCA after an internal investigation.
Posted Aug 22, 2025 - 06:36 EDT

Monitoring

Our team has implemented a fix for the disruption, and the service has returned to normal. We continue to monitor the situation to ensure stability and confirm that the service remains fully functional.

Impact:
Services should be operating normally; however, we continue monitoring for irregularities.
If you are still experiencing issues, please do not hesitate to reach out to the support team and update your ticket or report any problems you haven't reported yet.

Next Steps:
We will provide a final update once the issue is resolved.

We appreciate your patience as we work through this issue.
Posted Aug 22, 2025 - 06:26 EDT

Update

Affected Services: Alerting V2
Cluster(s): EU1

Description:
We are currently experiencing degraded services. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
Users are not currently receiving or clearing device-only related V2 alerts.

Services: monitoring and the UI are not impacted. All legacy alerting is working.

Next Steps:
We will update this information as more details become available.

We appreciate your patience as we work to restore full functionality.
Posted Aug 22, 2025 - 06:09 EDT

Investigating

Affected Services: Alerting V2
Cluster(s): EU1

Description:
We are currently experiencing degraded services. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.

Impact:
Users may experience a delay of up to 12 hours for device-only related V2 alerts.

Services monitoring and the UI are not impacted.

Next Steps:
We will update this information as more details become available.

We appreciate your patience as we work to restore full functionality.
Posted Aug 22, 2025 - 05:32 EDT
This incident affected: Network Mgmt (eu1.my.auvik.com).