Service Disruption - EU1 cluster is experiencing an outage
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - EU1 Customers Experienced an Outage Following the April 20, 2024, Upgrade

Root Cause Analysis

Duration of incident

Discovered: Apr 20, 2024, 12:36 - UTC
Resolved: Apr 20, 2024, 16:10 - UTC

Cause

A scheduled upgrade was performed on the EU1 cluster to address software requirements for performance and security improvements.

Effect

Scheduled processes would not run, and network connectivity issues were experienced for clients on the EU1 cluster.

Action taken

All times in UTC
04/20/2024

10:31 - Planned upgrade occurring during scheduled maintenance.

12:36 - Issues from the upgrade are detected.

12:50 - Initial mitigation to address issues taken.

13:00 - Initial mitigation step deems insufficient. Investigation for the next steps started.

13:34 - Additional mitigation steps implemented.

14:15 - The concluding steps to address disruption taken by engineering to clear out the failed upgrade.

15:10 - EU1 cluster and clients appear to be recovering.

16:00 - The old data is cleared from the internal pods.

16:10 - The incident is declared resolved.

Future consideration(s)

  • The order of operations list will be reviewed and standardized for upgrades to part of the Auvik product.
  • Ensure that the Subject Matter Expert (SME) approval has been signed off on and that an SME is available when pertinent upgrades are scheduled.
  • Enforce the preferred roll-back processes where upgrades to the product are implemented.
Posted Apr 29, 2024 - 12:37 EDT

Resolved
The source of the disruption has been resolved, and services have been fully restored.
Posted Apr 20, 2024 - 12:10 EDT
Monitoring
We’ve identified the source of the service disruption and applied a fix. Sites are starting, and we are monitoring to ensure all systems are functional.
Posted Apr 20, 2024 - 11:38 EDT
Identified
We’ve identified the source of the service disruption to EU1. Sites continue to be down at this time. We are working to apply changes and restore service as quickly as possible.
Posted Apr 20, 2024 - 11:04 EDT
Investigating
We’re experiencing an outage on the EU1 cluster. Customers will be unable to access their sites at this time. We will continue to provide updates as they become available
Posted Apr 20, 2024 - 10:04 EDT
This incident affected: Network Mgmt (eu1.my.auvik.com).