Discovered: Dec 6, 2023, 15:45 - UTC
Resolved: Dec 7, 2023, 01:30 - UTC
An internal service that injects network and IP data into the product was in a crash loop. (Repartioner Service)
This caused the Consolidation services that attach IPs to devices to understand the IP had been deleted. This mismatch of data then caused the devices to lose their association with their actual IPs, resulting in orphaned devices.
12/05/2023
16:15 - Backend services related to the Juniper Mist Release to GA on the US4 cluster are beginning to report errors. The backend Repartioner service fell into a crash loop.
12/06/2023
15:45 - Auvik Support reports a client has devices with what appear to be deletions of attached IPs. Several more tickets follow in quick succession. Engineering is alerted to the issue and begins its investigation.
16:30 - An incident is declared and posted to the Auvik status page. Engineering continues to investigate the cause. Engineering turns off the consolidation engineer on the US4 cluster to prevent any more deletions.
16:30 -17:00 - Engineering identifies the Repartioner service is crashing, looping, and restarts the service successfully. It is determined the Repartioner service needs more resources to process the accumulated data lag from the last day. Additional resources are provisioned.
17:00 - The lag is processed through the Repartioner service. The processed data is now attempting to catch up with the production environment.
17:30 - Injecting the delayed data back into the product on the US4 cluster will take a while. Adjustments to US4 cluster processing services are made to allow the lagged data to catch up more expediently. It is noted that devices with orphaned IPs are recovering.
12/06/2023 -12/07/2023
17:30 - 1:30 - Engineering monitors the data lag decrease and validates the data can catch up.
12/07/2023
01:30 - Data lag for the IP and network data on cluster US4 has caught up.
09:41 - The Auvik status page posts that the incident has been closed.