Discovered: Mar 09, 2024 11:00 - UTC
Resolved: Mar 10, 2024 03:21 - UTC
Significant maintenance upgrade to the system.
Maps were unavailable to customers across the platform.
All times in UTC
03/09/2024
11:00 - Regularly planned maintenance performed on the system. This included infrastructure upgrades.
13:00 - Maintenance completed. A few internal issues were noticed, and action was taken to address them.
13:40 - Internal issues noticed at the end of maintenance appear to be addressed and resolved.
15:20 - Auvik Engineering is aware of issues with maps not loading in the UI.
15:30 - Additional permission issues were also discovered. An incident is declared, and the on-call team is assembled.
15:30-18:40 - The engineering team begins its investigation and works to discover the incident's underlying cause.
18:40 - The Core data actors and injector are restarted. Engineering must wait for results as the system reloads data.
18:40-21:40 - Engineering observes the results as they update. It is determined that the restart did not provide the desired outcome and that issues while recovering are occurring too slowly for a product environment.
Engineering decides to perform a complete system restart. The restart will involve staggering individual cluster restarts to prevent overloading the core part of the product.
21:40 - Engineering performs the complete system restart with staggered starts of each cluster.
03/10/2024
03::21 - All clusters have successfully restarted, and Map functionality is back at an acceptable product level. The incident is declared closed.
Improve tenant inspection after maintenance windows to validate that there are no adverse effects from the changes implemented, especially after a more significant or complex upgrade.
Create improved guidance for when a complete system restart and specific criteria to apply it are warranted.
Investigate why changes to the system from this upgrade caused a delay in map rendering that forced a staggered reboot.