Service Disruption - Delay in processing TrafficInsights data in US4 cluster
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - Traffic Insights (TI) Data Stopped Processing on US4 Cluster

Root Cause Analysis

Duration of incident

Discovered: Nov 6, 2023, 13:04 - UTC
Resolved: Nov 7, 2023, 06:32 - UTC

Cause

Updates to code caused an unexpected restart of services that affected TI data flow on the US4 cluster.

Effect

The restart of services began a restart loop that prevented TI data in the US4 cluster from flowing into the user interface as expected.

Action taken

All times in UTC

11/06/2023

13:04 - After approval code is released into production.

13:05 - Services unexpectedly restart. The restart loop of services begins, which causes a delay in updating TI data in the US4 cluster.

13:34 - An internal alert is fired, notifying Auvik Engineering that TI data on the US4 cluster was delayed.

15:52 - Engineering begins its investigation.

16:45 - Engineering adjusts the TI data flow for clients on the US4 cluster to bypass the restart issue.

16:48 - Engineering can confirm TI data flow back into the US4 cluster client is working. Engineering monitors the reduction of TI data lag in the US4 cluster.

18:00 - Engineering continues to monitor the reduction in TI data lag from being current for clients in the US4 cluster. Additional resources are allocated to speed up the lag reduction. Engineering continues to monitor.

11/07/2023

02:38 - All TI data lag is confirmed to have caught up.

06:32 - All data processes are confirmed as up-to-date and working correctly. The incident is closed.

Future consideration(s)

  • Auvik will update CPU limits on services related to Traffic insights to prevent resource bottlenecks.
  • Auvik will investigate and determine service dependencies and better document possible conflicts.
  • Auvik will update older code to take advantage of new services to prevent this type of incident with these services in the future.
Posted Nov 28, 2023 - 18:22 EST

Resolved
The resolution for disruption to Traffic Insights data processing in US4 has been implemented. The source of the disruption has been resolved, and services have been fully restored, and Traffic Insights data is flowing normally.

A Root Cause Analysis (RCA) will follow after a full review has been completed.
Posted Nov 07, 2023 - 05:41 EST
Monitoring
We’ve identified the source of the service disruption in TrafficInsights data processing in the US4 cluster and are monitoring the situation. The Traffic Insights data is catching up to the current flow. The time for Traffic Insights data to become current is projected somewhere around 03:00 Nov 7 UTC. Updates will follow.
Posted Nov 06, 2023 - 12:50 EST
Identified
We’ve identified the source of the service disruption with TrafficInsights data processing in US4. We are working to restore service as quickly as possible.
Posted Nov 06, 2023 - 11:00 EST
Update
We continue investigating the disruption of Traffic Insights data with clients in the US4 cluster. We will continue to provide updates as they become available.
Posted Nov 06, 2023 - 10:15 EST
Investigating
We’re experiencing disruption to TrafficInsights data processing us US4. We will continue to provide updates as they become available.
Posted Nov 06, 2023 - 09:41 EST
This incident affected: Auvik TrafficInsights.