Service Disruption - Clients on the EU2 cluster are having access issues to their sites. Throwing 502 errors.
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - Clients on the EU2 Cluster Intermittently Receive 502 Errors When Connecting

Root Cause Analysis

Duration of incident

Discovered: Oct 31, 2023, 22:35 - UTC
Resolved: Nov 01, 2023, 15:58 - UTC

Cause

An update to an internal service caused repeated service reloads.

Effect

The repeated service reload caused increased memory depletions for the service, which in turn caused intermittent connection issues to client websites on the EU2 cluster.

Action taken

All times in UTC

10/31/2023

19:15 - The kOps service is updated on the EU2 cluster.

20:00 - Service issues begin with affected services on the EU2 cluster. Memory usage starts to increase.

10/31/2023

20:00 - The EU2 cluster clients start having 502 web page displays when they attempt to log in.

22:35 - Auvik internal alerting reports disconnection issues with the EU2 cluster clients.

11/01/2023

08:30 - Auvik Engineering begins the investigation.

09:50 - Auvik declares an incident and posts to the status page.

10:15 - Engineering adds additional memory resources to the service. This resolves the connection issues for the clients.

10:15- 14:33 - Engineering continues investigating to determine the root cause and permanent fix.

15:37 - The fix is tested in a stage environment successfully.

16:15 - Auvik alerts its clients on the EU2 cluster it will implement the fix at 18:00 - UTC with possible service disruptions over the hour it will take to complete.

18:00 - Auvik implements the fix into the EU2 cluster

18:58 - Auvik completes installing the fix and clean-up processes from the fix implementation. The incident is resolved.

Future consideration(s)

  • Auvik has reviewed and updated the documentation for upgrading the affected Kops service to prevent this incident from reoccurring.
  • Auvik will create improved internal alerting to notify Auvik when resources are repeatedly being restarted abnormally.
  • Auvik will validate the restoration procedure for the Kops service if required.
Posted Nov 09, 2023 - 13:24 EST

Resolved
The fix for clients having issues connecting to their tenants on cluster EU2 has been implemented. The source of the disruption has been resolved, and services have been fully restored.

A Root Cause Analysis (RCA) will follow after a full review has been completed.
Posted Nov 01, 2023 - 15:04 EDT
Update
We’ve identified the source of the service disruption for clients connecting on the EU2 cluster. To resolve this issue, Auvik was required to upgrade a degraded service. This work has been completed. Internal services are resetting for ingress connections.
We are actively monitoring the follow-up and will update this page when complete.
Posted Nov 01, 2023 - 14:49 EDT
Update
We’ve identified the source of the service disruption for clients connecting on the EU2 cluster. To resolve this issue, Auvik was required to upgrade a degraded service. This work has been completed.
We are actively monitoring the follow-up and will update this page when complete.
Posted Nov 01, 2023 - 14:40 EDT
Monitoring
We’ve identified the source of the service disruption for clients connecting on the EU2 cluster. To resolve this issue, Auvik is required to upgrade a degraded service. This work has begun. This overall action should take up to an hour, with any disruptions to any individual tenant lasting no more than one minute if an interruption occurs. We apologize for any unscheduled downtime that may arise due to this action.
We are actively monitoring and will update this message once complete.
Posted Nov 01, 2023 - 14:00 EDT
Identified
We’ve identified the source of the service disruption for clients connecting on the EU2 cluster. To resolve this issue, Auvik is required to upgrade a degraded service. Auvik will perform this work at 18:00 UTC (6:00 PM GMT). This overall action should take an hour with any disruptions to any individual tenant lasting no more than one minute, if an interruption occurs. We apologize for any unscheduled downtime that may occur due to this action.
Posted Nov 01, 2023 - 12:32 EDT
Update
We’ve identified the source of the service disruption with access for clients to their sites on EU2 and are monitoring the situation. We have implemented changes to alleviate disruption. We continue to work on resolving the root cause. All clients should have no issue connecting. We’ll keep you posted on a resolution.
Posted Nov 01, 2023 - 10:28 EDT
Monitoring
We’ve identified the source of the service disruption with access for clients to their sites on EU2 and are monitoring the situation. We have implemented changes to alleviate disruption. We’ll keep you posted on a resolution.
Posted Nov 01, 2023 - 07:18 EDT
Investigating
We’re experiencing disruption for clients in the EU2 cluster. Access to their sites is impacted. We will continue to provide updates as they become available.
Posted Nov 01, 2023 - 05:53 EDT
This incident affected: Network Mgmt (eu2.my.auvik.com).