Performance Issue - Map Rendering Delayed on US Cluster Customers

Incident Report for Auvik Networks Inc.

Postmortem

Service Disruption - Data update delays caused performance delays and delays in Map rendering.

Root Cause Analysis

Duration of incident

Discovered: Jun 1, 2024, 20:30 - UTC
Resolved: Jun 7, 2024, 13:05 - UTC

Cause

Updates performed during scheduled maintenance on June 1, 2024, caused an improper assertion on data in the Auvik application’s data stream.

Effect

The service disruption resulted in a significant increase in the data in the streaming queue, leading to noticeable delays in data processing for our customers. This was particularly evident in map rendering and updating, impacting the real-time visibility of our services for our stakeholders.

Action taken

All times in UTC

06/01/2024

20:30 - Auvik support alerts the on-call engineering team of abnormal CPU spikes in processing data.

20:59 - The engineering team begins its initial investigation.

21:11 - Engineering determines that the system is, indeed, seeing increased data input within the system.

21:15 - The team works to identify the cause of the increased input.

23:00 - The team identifies the specific data flows and increased input and turns off the presumed change that caused these issues.

06/02/2024

02:00 - The team implements the changes into one cluster and waits to validate that the change resolves the issues.

11:28 - It is reported that the change did not resolve the issue and the ongoing incident. The engineering team assembles to determine the root cause.

11:45-17:00 - Engineering continues investigating the issue to determine a fix.

17:00 - The root cause of the issues is determined, and the next steps to resolve the incident are formulated.

17:00-21:30 - A fix for the issues is written and tested successfully.

22:45 - A plan for deploying the fix to production is formulated.

06/03/2024

01:00-2:45 - The proposed fix is deployed to one cluster to test and validate its correctness in the production environment.

13:30 - The team validates the desired results in the test cluster and formulates a plan for the remaining clusters.

16:00-21:30 - The fix is deployed to the remaining clusters. The team will wait for the backlog to catch up.

06/04/2024

05:00-18:55 - Engineering makes several changes to increase resourcing and velocity of the backlog processing. During this time, all non-US clusters recover from their data delay.

23:15 - The US4 cluster recovers from its data delay.

06/06/2024

09:00 - The US3 and US5 clusters recover from their data delay.

06/07/2024

08:35 - The US1 cluster recovers from its data delay.

13:05 - The US2 cluster recovers from its data delay. The incident is closed on the status page.

Future consideration(s)

  • Adjust alert workflow to understand better when problems arise with the product in a more timely manner.
  • Auto-scale resources to adjust to dynamic demands for resourcing.
  • Investigate the testing environment to provide more valuable results when implementing system changes that reflect the actual impact on the production systems.
Posted Jun 20, 2024 - 10:41 EDT

Resolved

The impact of any remaining lag is negligible for customers on the US2 cluster and should resolve itself.

All other clusters are running optimally.

We are closing this incident at this time.

A Root Cause Analysis (RCA) will follow after completing a full review.
Posted Jun 07, 2024 - 09:05 EDT

Update

The US1 cluster has fully recovered.

Most of the US2 cluster’s clients have fully recovered. The delay is only in processing interface information in the map and only applies to a small subsection of clients. The estimate for recovery on this final part of the data lag for this map component depends on the influx of data it receives today.

All other parts of the product are running normally.

We continue actively monitoring the situation while waiting for this final component to recover from its data lag.

We understand the impact of this incident on your experience with the product and sincerely apologize for the inconvenience it has caused.
Posted Jun 07, 2024 - 05:53 EDT

Update

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US1 and US2 clusters. We are waiting as the informational lag works through the data backlog.

The US1 cluster has had most of its clients fully recover. However, a very small subsection of clients still has a data lag and is delayed. Due to a heavy influx of data, the cluster is processing the data but maintaining the backlog size. The delay is only in processing interface information in the map.

The US2 cluster is still delayed but is still decreasing in lag. Customers are only experiencing interface information delays in the map.

We anticipate a full recovery by 11:00 UTC (7:00 EDT) tomorrow.

Rest assured, the dashboard information and alerts remain unaffected, providing up-to-date and accurate information.

We are diligently and actively monitoring the situation. We are waiting for the remaining components to catch up and be current.

We understand the impact of this incident on your experience with the product and sincerely apologize for the inconvenience it has caused.
Posted Jun 06, 2024 - 18:44 EDT

Update

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US1 and US2 clusters. We are waiting as the informational lag works through the data backlog.

The US1 cluster has had most of its clients fully recover. However, a very small subsection of clients still has a data lag and is delayed. Due to a heavy influx of data, the cluster is processing the data but maintaining the backlog size. The delay is only in processing interface information in the map.

The US2 cluster is still delayed but is still decreasing in lag. Customers are only experiencing interface information delays in the map.

Rest assured, the dashboard information and alerts remain unaffected, providing up-to-date and accurate information.

We are diligently and actively monitoring the situation. We are waiting for the remaining components to catch up and be current.

We understand the impact of this incident on your experience with the product and sincerely apologize for the inconvenience it has caused.
Posted Jun 06, 2024 - 13:58 EDT

Update

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US1 and US2 clusters. We are waiting as the informational lag works through the data backlog.

The US1 cluster has almost recovered, with only a small subset of customers experiencing interface information delays in the map.

The US2 cluster is still delayed, customers are only experiencing interface information delays in the map.

Customers on the US3 and US5 clusters have fully recovered since the last update.

Dashboard information and alerts are not affected and are providing up-to-date information.

We are actively monitoring the situation and waiting for the remaining components to catch up and be current.

We understand the impact of this incident on your experience with the product and we sincerely apologize for the inconvenience it has caused.
Posted Jun 06, 2024 - 05:24 EDT

Update

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US clusters (US1, US2, US3, US5). We are waiting as the informational lag works through the data backlog.

Dashboard information and alerts are not affected and are providing up-to-date information.

The maps for customers on a small percentage of customers on the US5 cluster still show delayed inferred connections, but the rest of the map should be current. The inferred connection delay should conclude in the next several hours.

Clients on US clusters US1, US2, and US3 continue to decrease their lag. We now estimate it will take another 10-12 hours for all clusters' Map discovery and rendering to be current again. Several components are again current in the map. We are waiting for the remaining components to catch up and be current. We continue to monitor this.

We understand the impact this is having on your experience with the product and apologize for any impact this may be having on you and your clients.
Posted Jun 05, 2024 - 17:42 EDT

Update

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US clusters (US1, US2, US3, US5). We are waiting as the informational lag works through the data backlog.

Dashboard information and alerts are not affected and are providing up-to-date information.

The maps for customers on the US5 cluster still show delayed inferred connections, but the rest of the map should be current. The inferred connection delay is still dropping and should become current in the next 4 hours.

Clients on US clusters US1, US2, and US3 are continue to decrease their lag. We now estimate it will take another 18-20 hours for all clusters' Map discovery and rendering to be current again. We will continue to monitor this.

We understand the impact this is having on your experience with the product and apologize for any impact this may be having on you and your clients.
Posted Jun 05, 2024 - 11:50 EDT

Update

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US clusters (US1, US2, US3, US5). We are waiting as the informational lag works through the data backlog.

Dashboard information and alerts are not affected and are providing up-to-date information.

The maps for customers on the US5 cluster still show delayed inferred connections, but the rest of the map should be current. The inferred connection delay is still dropping and should become current in the next 4-8 hours.

Clients on US clusters. US1, US2, and US3 are continuing to decrease their lag. We estimate it will take another 24 hours for all clusters' Map discovery and rendering to be current again. We will continue to monitor it.

We apologize for the impact this may be causing you and your clients.
Posted Jun 05, 2024 - 06:00 EDT

Update

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US clusters (US1, US2, US3, US5). We are waiting as the informational lag works through the data backlog.

Dashboard information and alerts are not affected and are providing up-to-date information.

We still expect clients on the US5 cluster to recover from their lag sometime during the evening, most likely in the next four hours.

Clients on US clusters. US1, US2, and US3 are slowly decreasing their lag. We do not have an estimate of when their Map discovery and rendering will be current, but we continue monitoring it closely.

We apologize for the impact this may be causing you and your clients.

We continue to monitor progress and will post relevant updates.
Posted Jun 04, 2024 - 19:22 EDT

Update

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US clusters (US1, US2, US3, US5). We are waiting as the informational lag works through the data backlog.

Dashboard information and alerts are not affected and are providing up-to-date information.

We expect clients on the US5 cluster to recover from their lag at some point during the evening.

Clients on US clusters. US1, US2, and US3 are slowly decreasing their lag. We do not have an estimate of when their Map discovery and rendering will be current, but we continue monitoring it closely.

We apologize for the impact this may be causing you and your clients.

We continue to monitor progress and will post relevant updates.
Posted Jun 04, 2024 - 15:13 EDT

Monitoring

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US clusters (US1, US2, US3, US5). We are waiting as the informational lag works through the data backlog.

The lag for clients on the US4 cluster should be recovered in the next hour.

Dashboard information and alerts are not affected and are providing up-to-date information.

We apologize for the impact this may be causing you and your clients.

We continue to monitor progress and will post updates throughout the delay.
Posted Jun 04, 2024 - 12:57 EDT

Identified

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US clusters (US1, US2, US3, US4, US5). We are waiting as the informational lag works its way through the data backlog.

Dashboard information and alerts are not affected and are providing up-to-date information.

All relevant resources have been upgraded to provide the most expedient resolution.

We apologize for the impact this may be causing you and your clients.

We will continue to monitor the progress and post updates throughout the day.
Posted Jun 04, 2024 - 10:31 EDT
This incident affected: Network Mgmt (us1.my.auvik.com, us2.my.auvik.com, us3.my.auvik.com, us4.my.auvik.com, us5.my.auvik.com).