Smoke detectors in large scale production systems | Abhijeet Mishra | Conf42 SRE 2022

Static alerting thresholds no longer cut it for modern distributed systems. With production systems scaling rapidly, using static alerting to observe critical systems is a recipe for disaster. Observability tools have recognized this need and provide a way to ""magically"" catch deviations from normal system behavior instead. - What is that magic? What goes in to deciding whether a spike or a drop is violating a known ""good"" condition or not? - How do we avoid alert fatigue? - How do you factor in seasonality - low off peak hours and high holiday traffic? The rabbit hole goes deeper than I imagined. As a part of the core data science team at Last9, I ran into scenarios where my assumptions of building anomaly detection engines were shattered and rebuilt with every interaction with production traffic. In this talk, I will talk about: - What I learnt when trying to find answers to the above questions. - How known theoretical models map to real world workloads e.g. streaming services, high frequency trading applications etc. - The science that goes behind choosing and calculating the right SLOs for different SLIs and sending out early warnings and how to measure and improve leading and lagging indicators pertaining to system health. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk