Service Disruption - US4 cluster may be unavailable to some customers
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - Some clients on the US4 cluster could not connect to their tenants

Root Cause Analysis

Duration of incident

Discovered: Dec 5, 2023, 15:47 - UTC
Resolved: Dec 5, 2023, 19:00 - UTC

Cause

Rolling out the new Juniper Mist capabilities for GA release to clients.

Effect

The amount of accumulated data in the product for the Juniper Mist feature overloaded the capabilities of the US4 cluster to process the data. The amount of historical data pushed to production was too large with a few clients on this cluster, which caused a few backend nodes to fail. This caused connectivity issues with the clients associated with the failed backend nodes.

Action taken

All times in UTC

12/05/2023

15:47 - Auvik Engineering begins to roll out the GA release for the Juniper Mist monitoring.

15:55 - Backend services related to this upgrade show signs of stress.

16:15 - Errors occur on the US4 cluster, with some tenants having issues connecting.

16:30 - All other clusters complete the Juniper Mist release action except US4.

17:05 - A decision is made to roll back changes for the Juniper Mist release on the US4 cluster. Engineering performs the rollback and waits for the changes to propagate in the US4 Cluster.

17:26 - A few of the backend nodes continued to throw errors. Engineering restarts these backend nodes to clear the errors.

18:00 - The US4 cluster was running normally. The incident is closed.

Future consideration(s)

  • Auvik will roll out the Juniper Mist GA release to the US4 cluster during its scheduled maintenance windows on December 16, 2023, to complete the release.
  • Auvik will adjust how it rolls out new functionality if it entails large amounts of data movement within the product. It will roll out the changes in discrete stages instead of out to the cluster as a whole.
Posted Dec 17, 2023 - 19:57 EST

Resolved
The resolution for disruption to the US4 cluster, with some customers hosted on this cluster not being able to access their site, has been implemented. The source of the disruption has been resolved, and services have been fully restored.

A Root Cause Analysis (RCA) will follow after a full review has been completed.
Posted Dec 05, 2023 - 14:42 EST
Monitoring
We’ve identified the source of the service disruption on the US4 cluster of clients connecting to their sites and are monitoring the situation. Clients should be able to connect successfully. We’ll keep you posted on a resolution.
Posted Dec 05, 2023 - 14:22 EST
Identified
We’ve identified the source of the service disruption with the US4 cluster. Some customers hosted on this cluster may be unable to access their site. We are working to restore service as quickly as possible.
Posted Dec 05, 2023 - 13:24 EST
Investigating
We’re experiencing a disruption to the US4 cluster. Some customers hosted on this cluster may not be able to access their site. We will continue to provide updates as they become available.
Posted Dec 05, 2023 - 12:20 EST
This incident affected: Network Mgmt (us4.my.auvik.com).