Service Disruption - Traffic Insights data on the EU1 Cluster
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - Delay of TrafficInsights data on the EU1 Cluster

Root Cause Analysis

Duration of incident

Discovered: Jan 23, 2024 12:30 - UTC
Resolved: Jan 23, 2024 15:30 - UTC

Cause

TrafficInsights' back-end in the EU1 cluster exhausted its resources. The services involved stopped processing data.

Effect

TrafficInsights stopped processing data to the UI.

Action taken

All times in UTC

01/23/2024

12:30 - Auvik Engineering receives an internal alert that TrafficInsights has stopped processing data on the EU1 cluster.

12:35 - Engineering confirms no current work has caused the stoppage of data flow.

12:40 - Engineering cancels the TrafficInsights processing job and restarts it to begin processing TrafficInsights data again on the EU1 cluster.

12:45 - The restart fails to complete successfully.

13:30 - Additional resources are added to the processes being called.

14:20 - The location of where to begin the data flow is adjusted to start from when it failed instead of an older safe point to bring the TrafficInsights data on the EU1 cluster current in the efficient time.

14:38 - TrafficInsights data in the EU1 cluster begins flowing successfully.

15:30 - Data lag for TrafficInsights on the EU1 cluster has caught up with the current data being processed from the devices. The incident is marked as resolved.

Future consideration(s)

  • Auvik will improve monitoring around resources of the backend services of TrafficInsights.
  • Resources for backend services for TrafficInsights will be increased across clusters to accommodate increased data.
  • Outline and discovery for depreciating the current processing engine used by TrafficInsights to increase its resilience.
Posted Jan 31, 2024 - 05:38 EST

Resolved
The resolution for the delay of Traffic insights data in the EU1 cluster has been achieved. The source of the disruption has been resolved, and services have been fully restored.

A Root Cause Analysis (RCA) will follow after a full review has been completed.
Posted Jan 23, 2024 - 12:05 EST
Monitoring
We’ve identified the source of the service disruption with Traffic Insights on the EU1 cluster, have addressed the cause, and are monitoring the situation. Traffic Insight data is delayed and will need approximately two hours to be caught up and current. We’ll keep you posted on a resolution.
Posted Jan 23, 2024 - 09:50 EST
Identified
We’ve identified the source of the performance issue with Traffic Insights on the EU1 cluster. Data is delayed. We are working to restore optimal service as quickly as possible.
Posted Jan 23, 2024 - 09:35 EST
Investigating
We’re experiencing disruption with Traffic Insights on the EU1 cluster. Data is delayed. We will continue to provide updates as they become available.
Posted Jan 23, 2024 - 08:46 EST
This incident affected: Network Mgmt (eu1.my.auvik.com).