Discovered: Oct 31, 2023, 22:35 - UTC
Resolved: Nov 01, 2023, 15:58 - UTC
An update to an internal service caused repeated service reloads.
The repeated service reload caused increased memory depletions for the service, which in turn caused intermittent connection issues to client websites on the EU2 cluster.
Action taken
All times in UTC
10/31/2023
19:15 - The kOps service is updated on the EU2 cluster.
20:00 - Service issues begin with affected services on the EU2 cluster. Memory usage starts to increase.
10/31/2023
20:00 - The EU2 cluster clients start having 502 web page displays when they attempt to log in.
22:35 - Auvik internal alerting reports disconnection issues with the EU2 cluster clients.
11/01/2023
08:30 - Auvik Engineering begins the investigation.
09:50 - Auvik declares an incident and posts to the status page.
10:15 - Engineering adds additional memory resources to the service. This resolves the connection issues for the clients.
10:15- 14:33 - Engineering continues investigating to determine the root cause and permanent fix.
15:37 - The fix is tested in a stage environment successfully.
16:15 - Auvik alerts its clients on the EU2 cluster it will implement the fix at 18:00 - UTC with possible service disruptions over the hour it will take to complete.
18:00 - Auvik implements the fix into the EU2 cluster
18:58 - Auvik completes installing the fix and clean-up processes from the fix implementation. The incident is resolved.