Discovered: Oct 29, 2021 Time - 16:47 UTC
Resolved: Oct 29, 2021 Time - 21:30 UTC
A problematic flow from a device using the CA1 cluster for TrafficInsights caused the parser to crash and continually reboot.
TrafficInsights data processing was delayed on the CA1 cluster resulting in stale information being shown on the dashboard. The service disruption did not result in any data loss. All other services were unaffected.
10/29/2021 - All times in UTC
16:42 TrafficInsights was enabled for a new device and a problematic flow was sent.
16:57 Auvik engineering team is made aware of the issue.
17:40 Auvik engineering team restarts the TrafficInsights service to see if the service will successfully bypass the problematic data causing the crash. The issue is not resolved.
18:00 Auvik engineering team determines the cause of the issue and proceeds to begin identification of the offending device.
19:05 Auvik identifies a specific device causing the service interruption.
19:40 Auvik finishes the code to ignore the device.
20:10 Auvik finishes the testing of code and introduces it into production.
20:16 Auvik increases the scales of data processing to quickly process the backlog.
20:42 Auvik confirms data is being processed correctly.
21:30 Auvik confirms the backlogged data has been fully processed with no data loss.
Auvik will improve the ability to handle unexpected data or behavior in a graceful manner.
Auvik will plan to develop internal tooling to identify offending devices more quickly and efficiently.