Service Disruption - Delay with the Delivery Syslog Messages to Clients on Cluster US2
Incident Report for Auvik Networks Inc.
Postmortem

Service Disruption - Delay with the Delivery of Syslog Messages to Clients in US2 Cluster

Root Cause Analysis

Duration of incident

Discovered: Oct 26, 2023, 05:30 - UTC
Resolved: Oct 27, 2023, 18:52 - UTC

Cause

Disk space ran out on the processing disks for Syslog on the US2 cluster.

Effect

Syslog message delivery was stopped to clients on the US2 cluster.

Action taken

All times in UTC

10/26/2023

05:30  - An internal alert was created that Syslog messaging was not working on the US2 cluster.

07:15 - Auvik Engineering begins its investigation.

08:30 - Engineering begins action to increase disk space to be able to process Syslog messages.

09:20 - Engineering alters data retention policy to ensure no data is lost due to the delay.

10:02 - Engineering triggers the new policy to test rollout.

11:10 - Engineering validates new settings and proceeds to see data lag continue to shrink and customer information now flows appropriately.

11:15 - The initial incident is marked as closed.

10/27/2023

09:10 - Data was checked for the cluster as part of standard operating procedure. Data restored by the policy implementation was no longer there.

09:20 - The Auvik Engineering team proceeds to launch an investigation.

09:45 - Engineering confirms that Syslog data from the last 20 days was absent for US2 cluster clients.

10:10 - 10:35 - The log entry for why the Syslog data was deleted was located. The location of the backup of the data was also obtained.

10:43 - Engineering begins to restore the absent Syslog data to the US2 cluster.

10:43 - 18:50 - The data for the Syslog messages is restored to the US2 cluster for clients.

18:52 - The restoration is finished. The incident is closed.

Future consideration(s)

  • Auvik will add alerting for the particular disk space issues attributed to this incident, including the cause and repair process.
  • Auvik will upgrade the specific systems to avoid disk space issues like this occurring again.
  • Auvik will update the documentation with retention policies to reflect timing issues with policy changes.
Posted Nov 09, 2023 - 14:03 EST

Resolved
The disruption with delivering Syslog messages to clients on the US2 cluster has been resolved, and services have been fully restored.

A Root Cause Analysis (RCA) will follow after a full review has been completed.
Posted Oct 26, 2023 - 07:38 EDT
Monitoring
We’ve identified the source of the service disruption with the delivery of Syslog messages to clients on the US2 cluster and are monitoring the situation. The Syslog delay is now catching up. All clients should see current Syslog messages in the next couple of hours. We’ll keep you posted on a resolution.
Posted Oct 26, 2023 - 06:36 EDT
Identified
We’ve identified the source of the service disruption with the delivery of Syslog messages to clients on the US2 cluster. We are working to restore service as quickly as possible.
Posted Oct 26, 2023 - 06:10 EDT
Investigating
We’re experiencing disruption with the delivery of Syslog messages to clients on the US2 cluster. We will continue to provide updates as they become available.
Posted Oct 26, 2023 - 05:03 EDT
This incident affected: Network Mgmt (us2.my.auvik.com).