Discovered: Jan 17, 2024 15:27 - UTC
Resolved: Jan 18, 2024 04:30 - UTC
A CORE settings change was implemented on 50% of Auvik clients after a successful initial rollout to 5% of Auvik clients the day before.
Clients that were part of the 50% under the setting change became inaccessible. A percentage of these clients disabled themselves from being activated due to the number of attempted restarts that accompanied the disconnection.
All times in UTC
01/17/2024
15:27 – Auvik Engineering enables the same CoreSettings for 50% of tenants after a successful dry run from the previous day with 5% of its clients.
15:35 – Internal Auvik alerts notify Engineering of a significant service disruption.
15:36 – Engineering begins its Investigation.
15:39 – The backend services of the clients where changes were implemented stop reporting metrics.
15:53 – Engineering reverts the change that was implemented.
16:30 – Engineering manually begins restarting clusters of the affected clients.
18:40 – Engineering begins manually repairing the connections to back-end services of clients that are not starting or reporting metrics properly.
21:00 – All clusters are recovered. Engineering is seeing successful reporting of services and believes the incident to be over. The incident is marked as resolved on the Status page.
21:51 – Auvik Support receives notice that one of the affected client’s tenants has been unexpectedly disabled.
2024-01-18
01:31 – Auvik continues to receive more client reports of unexpectedly disabled tenants.
02:30 – The Auvik Engineering On-Call team is engaged.
03:37 – Engineering determines the number of tenants unexpectedly disabled to be just over 1000.
03:50 – Engineering re-enables the disabled tenants.
04:30 – The number of running tenants is back to its pre-incident level. This incident is officially closed.