Service Disruption - Cluster US3 clients fail to connect to tenants
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - Clients Fail to Connect to Their Tenants on the US3 Cluster

Root Cause Analysis

Duration of incident

Discovered: Dec 19, 2023 15:58 - UTC
Resolved: Dec 19, 2023 18:15 - UTC

Cause

Action to address clean-up of residual issues from the incident, Service Disruption - Devices Deleted from Auvik UI when seen as down in health check.

Effect

Resources on the backend of the US3 cluster were overloaded. Collectors then disconnected from tenants on backends. This caused logins to fail until the tenants were restarted. The US3 cluster reboot was then performed to regain cluster stability, which behaved like an Auvik biweekly maintenance window.

This is when all collectors disconnected, and tenants could not log in for a few minutes up to a few hours, depending on the order of when they restarted.

No data loss of existing collected data occurred during this incident until the recovery.

Action taken

All times in UTC

12/19/2024

15:58 - Auvik Engineering notices issues with US3 customer tenants.

16:20 - Initial investigation into metrics on the US3 cluster.

17:00 - Engineering decides to reboot the US3 cluster.

17:00- 18:15 Engineering monitors the tenants after the reboot, much like after a maintenance window. The Engineering manually brought up larger tenants.

18:15 - The Incident is deemed closed,

Future consideration(s)

  • Improve timeliness of communication when making changes to production. Over-communicate actions at the time they occur and not too far in front of the actions themselves.
  • Added documentation for this case of resource overload and protections from it being repeated in the future.
Posted Jan 09, 2024 - 06:01 EST

Resolved
Service on the US3 cluster has been restored.
Posted Dec 19, 2023 - 15:14 EST
Monitoring
We’ve identified the source of the service disruption on the US3 cluster and are monitoring the situation. We have taken steps to mitigate the cause. Tenants may still have issues connecting as we work through the issue. We’ll keep you posted on a resolution.
Posted Dec 19, 2023 - 14:11 EST
Identified
We’ve identified the source of the service disruption with the US3 Cluster. Access to tenants will be sporadic at this time. We are working to restore service as quickly as possible.
Posted Dec 19, 2023 - 13:39 EST
Investigating
We’re experiencing disruption to tenants on the US3 cluster. Clients failing to connect. We will continue to provide updates as they become available.
Posted Dec 19, 2023 - 13:31 EST
This incident affected: Network Mgmt (us3.my.auvik.com).