List of videos

Monitorama PDX 2023 - Observability Data Engineering

Jack Neely's session from Monitorama PDX 2023. I find the data in Observability fascinating. In every aspect of an SRE I see problems to solve with data rather than brute force. In fact, all of us in the Observability space are really Data Engineers and Data Scientists in disguise. The only way to fully understand our complex systems is through math and visualizations. Let's explore the 4 Golden Signals and the math behind why they work well and some tricks to bridging Observability and Business Intelligence. In this talk I'll cover each of the 4 Golden Signals and speak to the data engineering tools used for each to give folks a broad platform to discover new math and new techniques for solving their own data problems: - Traffic: Counters and Calculus -- The Physics Behind why Counters Work. - Errors: Counting vs Sampling and the Nyquest-Shannon Theory. Your CPU metrics are wrong and I can prove it. - Latency: Timers and Distributions -- Why averages are horrible and Anscombe's Quartet. Understanding Gamma Distributions. - Saturation: Percentiles and Pipelines -- Visualizing percentiles of data, why we cannot combine percentiles, and the magic of histograms. Finally, all of us in Observability have been asked at some point to participate in managing the data behind Business Intelligence. Data about the product likely comes from the product itself directly through its telemetry. Often the requirements are high cardinality and high in volume making it difficult to store raw data for months on end to produce accurate BI. We'll work around the limits of percentiles and show some tricks for extracting events, storing those as summary data, and producing monthly percentile based reports with T-Digests.

Watch
Monitorama PDX 2023 - Unknown unknowns and how to know them

Dylan Ratcliffe's session from Monitorama PDX 2023. It feels like outages always happen in the worst possible place, and as soon as we understand them, they happen somewhere else. Turns out this is true. So how can we deal with something that seems to be actively working against us? By changing the way we build mental models. The STELLA Report (Woods DD, Ohio State University) explains how our mental models of how a system works are constantly challenged by outages, and why it’s always the unknown unknowns that cause the biggest problems. We can’t possibly know about these things in advance (otherwise they’d be known unknowns), so instead we need to be able to develop mental models more quickly; in response to a specific outage, or as preparation for a given change. I'll talk about why this is so important, and why current tools really aren’t suited to job.

Watch
Monitorama PDX 2023 - A decade of monitoring

Joseph Ruscio's session from Monitorama PDX 2023. In the 10 years since the first Monitorama conference, the practice of Observability (née Monitoring) has gone through several successive evolutions. This talk starts by examining its origins in high performance computing, adoption by the web operations community through shared OSS, the subsequent rise of cloud computing and the multiple generations of commercial vendors it enabled. Each of these iterations came with distinct impacts/trade offs that ultimately led to the next one and we'll analyze these in some detail. We'll talk through a general theory of "innovation cycles", show how it applies to what's happened to-date, and use that as a framework to predict what the next waves might look like.

Watch
Monitorama PDX 2023 - Thinking Critically About Alerting

Dan Ravenstone's session from Monitorama PDX 2023. Alert fatigue, oddly enough, is still a huge pain point in a lot of organizations. Why is that? Is there a gap in knowledge or understanding on how to define good alerts? Is it that many engineers who set these alerts up wear many hats, so when it comes to alert quality, it's unfamiliar territory? Is there missing guidance around what constitutes a good alert? In this talk, I first want to highlight the negative impact poor alerts have on engineers. Next, I want to put this theory to practice by using 3 to 5 engineering teams to take this math logic and apply it to their current alerting and see how OR if this improves on alert fatigue.

Watch
Monitorama PDX 2023 - Stress, OnCall, and You

Mike Cox's session from Monitorama PDX 2023. In this talk we'll cover: - A light introduction into the history of stress, chronic stress, and stress related disease. - A deeper dive into stress research and its eery similarities with being on-call. - Data-driven takeaways from that research for reducing stress during incidents on-call rotations.

Watch
Monitorama PDX 2023 - Foster a Culture of Learning Through Observability

Brooke Sargent's session from Monitorama PDX 2023. Observability helps engineers understand the systems they build and work on, but as teams develop a practice of observability, it can also strengthen their team culture. This session tells the story of one platform engineering team’s observability adoption. Through this story, we will explore how observability creates a learning culture on software engineering teams, how it benefits different archetypes on the team, and how it enables onboarding new engineers and leveling up existing engineers more effectively

Watch
Monitorama PDX 2023 - Pushing Observability Uphill

Paige Cruz's session from Monitorama PDX 2023. For years tech companies have chased the fabled “single pane of glass” , the one observability tool to understand your system from north to south and east to west. Leafing through promo materials promising instant insights and seamless turnkey integrations you’d think increasing system observability is as easy as assembling a Lego set. In my experience chasing the “single pane of glass” translates to “pain in the ass”. Survey data supports this revealing the majority of engineers cite tool sprawl as a minor or non-existent problem despite relying on several tools. As alluring as the siren call of “single pane of glass” is, let's be practical and examine how to best observe systems across a myriad of tooling. From the telemetry buffet of metrics, events, traces and logs learn when to reach for which type and ways to bridge the gaps with links and enhance with context to free yourself from the fool’s errand of a “single pane of glass”.

Watch
Monitorama PDX 2023 - Custom End to End Monitoring Made Easy with Github Actions & Playwright

Ben Rockwood's session from Monitorama PDX 2023. Github Actions is an excellent CI/CD platform, but it's much more than that, it can be used as an effective serverless platform for monitoring. Paired with Microsoft Playwright, an excellent testing platform for web interaction, and a couple of scripts you can quickly and easily build a robust monitoring solution that exercises your entire solution, in and out of the browser. Join me to learn how we've used this pattern to build powerful customized monitoring with minimal time and cost. GitOps monitoring at its finest.

Watch
Monitorama PDX 2023 - Have no fear: Just the math you need to know for monitoring

Lerna Ekmekcioglu's session from Monitorama PDX 2023. We look at availability, throughput, latency as indicators of health of our systems. Aggregations on top of these measurements help us further reason about their well being by zooming out and looking at how they behave given a meaningful slice of data. We talk about p50, p90 or p99 latency. We use graphs with logarithmic scale. But do we actually understand the math behind these and how to use them with ease? In this talk, I will deep dive into just the math you need to know for monitoring.

Watch