Monitorama PDX 2023
2023
List of videos

Monitorama PDX 2023 - A Tale of Two Histograms
Adrian Cockcroft's session from Monitorama PDX 2023. Response times for real world systems form complex long tail distributions. The structure and behavior of your system will be visible in the responses, yet the mean and percentile measurements hide this structure, This talk will show how the structure can be measured and analyzed so that you can figure out a useful model of your system, understand when it is working well or about to collapse, what is driving the long tail of latencies, and come up with better service level agreements. The talk introduces a statistical concept called Finite Mixed Models without getting into the greek equations, and shows how to explore this data using R. Looking at response times this way is a far, far better thing that I do, than I have ever done before.
Watch
Monitorama PDX 2023 - Observability Anti Patterns
Adriana Villela's session from Monitorama PDX 2023. Many organizations claim to be all in on Observability, when in fact, they are not really following Observability practices. Over the last couple of years, I've observed a number of Observability anti-patterns, and in this talk, I will discuss the following anti-patterns: - Traces not treated as a first-class citizen - The Wall ‘o Dashboards - Getting someone else to instrument your code - Belief that Observability Tooling == Observability - Observability theatre I will also discuss how to avoid these anti-patterns, in order for teams to unlock Observability’s superpowers.
Watch
Monitorama PDX 2023 - Observability Data Engineering
Jack Neely's session from Monitorama PDX 2023. I find the data in Observability fascinating. In every aspect of an SRE I see problems to solve with data rather than brute force. In fact, all of us in the Observability space are really Data Engineers and Data Scientists in disguise. The only way to fully understand our complex systems is through math and visualizations. Let's explore the 4 Golden Signals and the math behind why they work well and some tricks to bridging Observability and Business Intelligence. In this talk I'll cover each of the 4 Golden Signals and speak to the data engineering tools used for each to give folks a broad platform to discover new math and new techniques for solving their own data problems: - Traffic: Counters and Calculus -- The Physics Behind why Counters Work. - Errors: Counting vs Sampling and the Nyquest-Shannon Theory. Your CPU metrics are wrong and I can prove it. - Latency: Timers and Distributions -- Why averages are horrible and Anscombe's Quartet. Understanding Gamma Distributions. - Saturation: Percentiles and Pipelines -- Visualizing percentiles of data, why we cannot combine percentiles, and the magic of histograms. Finally, all of us in Observability have been asked at some point to participate in managing the data behind Business Intelligence. Data about the product likely comes from the product itself directly through its telemetry. Often the requirements are high cardinality and high in volume making it difficult to store raw data for months on end to produce accurate BI. We'll work around the limits of percentiles and show some tricks for extracting events, storing those as summary data, and producing monthly percentile based reports with T-Digests.
Watch
Monitorama PDX 2023 - Unknown unknowns and how to know them
Dylan Ratcliffe's session from Monitorama PDX 2023. It feels like outages always happen in the worst possible place, and as soon as we understand them, they happen somewhere else. Turns out this is true. So how can we deal with something that seems to be actively working against us? By changing the way we build mental models. The STELLA Report (Woods DD, Ohio State University) explains how our mental models of how a system works are constantly challenged by outages, and why it’s always the unknown unknowns that cause the biggest problems. We can’t possibly know about these things in advance (otherwise they’d be known unknowns), so instead we need to be able to develop mental models more quickly; in response to a specific outage, or as preparation for a given change. I'll talk about why this is so important, and why current tools really aren’t suited to job.
Watch
Monitorama PDX 2023 - A decade of monitoring
Joseph Ruscio's session from Monitorama PDX 2023. In the 10 years since the first Monitorama conference, the practice of Observability (née Monitoring) has gone through several successive evolutions. This talk starts by examining its origins in high performance computing, adoption by the web operations community through shared OSS, the subsequent rise of cloud computing and the multiple generations of commercial vendors it enabled. Each of these iterations came with distinct impacts/trade offs that ultimately led to the next one and we'll analyze these in some detail. We'll talk through a general theory of "innovation cycles", show how it applies to what's happened to-date, and use that as a framework to predict what the next waves might look like.
Watch
Monitorama PDX 2023 - Thinking Critically About Alerting
Dan Ravenstone's session from Monitorama PDX 2023. Alert fatigue, oddly enough, is still a huge pain point in a lot of organizations. Why is that? Is there a gap in knowledge or understanding on how to define good alerts? Is it that many engineers who set these alerts up wear many hats, so when it comes to alert quality, it's unfamiliar territory? Is there missing guidance around what constitutes a good alert? In this talk, I first want to highlight the negative impact poor alerts have on engineers. Next, I want to put this theory to practice by using 3 to 5 engineering teams to take this math logic and apply it to their current alerting and see how OR if this improves on alert fatigue.
Watch
Monitorama PDX 2023 - Stress, OnCall, and You
Mike Cox's session from Monitorama PDX 2023. In this talk we'll cover: - A light introduction into the history of stress, chronic stress, and stress related disease. - A deeper dive into stress research and its eery similarities with being on-call. - Data-driven takeaways from that research for reducing stress during incidents on-call rotations.
Watch
Monitorama PDX 2023 - Foster a Culture of Learning Through Observability
Brooke Sargent's session from Monitorama PDX 2023. Observability helps engineers understand the systems they build and work on, but as teams develop a practice of observability, it can also strengthen their team culture. This session tells the story of one platform engineering team’s observability adoption. Through this story, we will explore how observability creates a learning culture on software engineering teams, how it benefits different archetypes on the team, and how it enables onboarding new engineers and leveling up existing engineers more effectively
Watch
Monitorama PDX 2023 - Pushing Observability Uphill
Paige Cruz's session from Monitorama PDX 2023. For years tech companies have chased the fabled “single pane of glass” , the one observability tool to understand your system from north to south and east to west. Leafing through promo materials promising instant insights and seamless turnkey integrations you’d think increasing system observability is as easy as assembling a Lego set. In my experience chasing the “single pane of glass” translates to “pain in the ass”. Survey data supports this revealing the majority of engineers cite tool sprawl as a minor or non-existent problem despite relying on several tools. As alluring as the siren call of “single pane of glass” is, let's be practical and examine how to best observe systems across a myriad of tooling. From the telemetry buffet of metrics, events, traces and logs learn when to reach for which type and ways to bridge the gaps with links and enhance with context to free yourself from the fool’s errand of a “single pane of glass”.
Watch
Monitorama PDX 2023 - Custom End to End Monitoring Made Easy with Github Actions & Playwright
Ben Rockwood's session from Monitorama PDX 2023. Github Actions is an excellent CI/CD platform, but it's much more than that, it can be used as an effective serverless platform for monitoring. Paired with Microsoft Playwright, an excellent testing platform for web interaction, and a couple of scripts you can quickly and easily build a robust monitoring solution that exercises your entire solution, in and out of the browser. Join me to learn how we've used this pattern to build powerful customized monitoring with minimal time and cost. GitOps monitoring at its finest.
Watch
Monitorama PDX 2023 - Have no fear: Just the math you need to know for monitoring
Lerna Ekmekcioglu's session from Monitorama PDX 2023. We look at availability, throughput, latency as indicators of health of our systems. Aggregations on top of these measurements help us further reason about their well being by zooming out and looking at how they behave given a meaningful slice of data. We talk about p50, p90 or p99 latency. We use graphs with logarithmic scale. But do we actually understand the math behind these and how to use them with ease? In this talk, I will deep dive into just the math you need to know for monitoring.
Watch
Monitorama PDX 2023 - Performance Testing Experimentation At Scale
Cliff Moon's session from Monitorama PDX 2023. The traditional statistical models used in A/B testing are built to support product decision making around things like buttons clicked, messages sent, etc. In other words normally distributed metrics. However what happens when we want to make decisions about the performance impact of an experiment? Performance metrics are decidedly non-normal, and typically subject to a long tail. Averages of such a dataset only have enough precision to surface the most egregious performance degradations. In this talk we'll discuss the development of a system for catching performance degradation of A/B experiments in an environment with thousands of concurrent experiments at any give period of time.
Watch
Monitorama PDX 2023 - How to Scale Observability Without Bankrupting the Company
David Gildeh's session from Monitorama PDX 2023. Every company has struggled with controlling their Observability costs as the amount of data that needs to be collected, stored and queried in real time has gone up exponentially! Netflix has had to solve this problem as it rapidly scaled into a web scale company with 100m's of customers around the world. In this talk, David will share some of the innovations and strategies Netflix has used to ingest huge amounts of data, while providing sub-second queries to all its Engineers, without bankrupting the company!
Watch
Monitorama PDX 2023 - Have you tried not storing all those metrics
Tony Rippy's session from Monitorama PDX 2023. When running services in production, the temptation is to keep all the metrics you can. It is difficult to decide in advance what metrics are needed to debug problems, or what system performance metrics will be important in the future. As the saying goes: “It’s better to have it and not need it, than need it and not have it.” There are several problems with this. First, it is wasteful; most monitoring data is written once and never read. Second, it can be wildly expensive! I have heard stories of big tech companies that end up spending millions of dollars a year on their monitoring bills, enough to impact the bottom line of the business. It also causes scalability problems, as high-cardinality data sets are notoriously difficult to store and query efficiently. But… What if you didn’t need to store this data? What if there was a way to reduce the amount of data but continue to meet your day-to-day requirements? This is the idea behind this talk. It covers experiments that use empirical distributions and samples in place of time series. We will discuss the pros & cons of this approach, and lessons learned trying to apply this in practice.
Watch
Monitorama PDX 2023 - Monitoring Mastodons: A story about Hachyderm
Hazel Weakly's session from Monitorama PDX 2023. Elephants may have a perfect memory, but humans sure don't; we're lucky to even see what we're looking at much less figure out what else is going on. Speaking of which, what ARE we even looking at over here at Hachyderm? We've had our share of major incidents and ongoing work; how has monitoring helped with that, and how has it not? These questions and more are going to be what we go over! Tune in to find out more about the monitoring stack we've built, why we chose it, and what we're doing next. While we're at it, we're going to be taking a deeper look at what monitoring is even for, how to know what you need, and the philosophical implications of it all.
Watch
Monitorama PDX 2023 - Alerts Don't Suck YOUR Alerts Suck!
Leon Adato's session from Monitorama PDX 2023. Nobody "likes" getting alerts. Best case, it tells you something went (or is about to go) wrong. But more often they're are meaningless, trivial, or just plain wrong - a source of constant interruptions, false alarms, unplanned work, and noise. While some say this is the inherent nature of alerts (and monitoring in general) the truth that well-crafted alerts based on insightful monitoring are a gift - saving hours of investigation and thousands of dollars. Whether your organization views alerts a curse or a blessing depends on the design and implementation of those alerts, more so than any specific monitoring tool or technique. And, like most things in technology, good design can be taught and learned. In this talk, I'll give a brief tour of the alerting hall of horrors, and then provide real-world, vendor-agnostic techniques to make alerts meaningful, effective, valuable, and actionable (as a bonus, I'll show how to make them manageable, too!). By breaking a few bad habits; understanding how and why vendors put their tools together in particular ways; and learning a few new concepts, you'll have people emailing you to say "thank goodness I got that alert!". Now there's something you probably don't hear every day.
Watch
Monitorama PDX 2023 - Connecting the dots with OTel semantic conventions
Richard Benwell's session from Monitorama PDX 2023. How do we make use of the increasing volume of observability data that we collect? Observability was inspired by control theory, but current observability solutions are missing a key element of that theory: the system model. We can’t understand the state of the system, or ‘answer unknown unknowns’ if we don’t know how the system works. We’re drowning in data but starved of answers! Discover how graphs (think social network graph, not line graph) and Open Telemetry semantic conventions can help us connect the dots.
Watch
Monitorama PDX 2023 - What we talk about when we talk about Observability Frameworks
Timothy Mahoney's session from Monitorama PDX 2023. It's important to have consistent data across an organisation but to insure data consistency, we have to insure the people responsible for producing that data have a common understanding. In implementing an observability framework for Ikea, we had to overcome not only technical hurdles but issues with taxonomy, semantics and language barriers to insure a common understanding among teams. I want to share my experiences as a senior engineer in an observability pipeline team and how we slowed down to speed up our company's observability journey.
Watch
Monitorama PDX 2023 - OpenResty, My Bestie: Tracing NGINX With Lua
Sam Handler's session from Monitorama PDX 2023. Shopify served 75.98 million requests per minute during Black Friday/Cyber Monday 2022, and our OpenResty deployments handled each of these requests before they hit an application server (OpenResty is a technology that lets you embed arbitrary Lua scripts into NGINX configuration files). Until recently, our routing stack was completely untraced, which left a huge blind spot in our view of our infrastructure. In 2022, we finally implemented tracing in our OpenResty deployments, and it wasn’t easy. In this talk, I’ll describe how we got a working tracing implementation. Along the way, I’ll explain the dangers of custom trace propagation formats, the joys of working in a well-specified open source project, the wonders (and challenges) of the OpenResty runtime, and the mental challenges that accompany the modification of NGINX, that famously performant HTTP server and reverse proxy.
Watch
Monitorama PDX 2023 - Building a real time cloud cost management program with observability
Emily Nakashima's session from Monitorama PDX 2023. Advances in monitoring and observability have given so many of us the confidence that we knowing what's happening in every corner of our systems, but for many teams, one system facet remains stubbornly un-observable: cost. This talk will tell the story of how a surprise giant AWS bill sent our growing startup engineering team on a mission to be able to observe our cloud spend with the same clarity and immediacy as performance, reliability, or any other important system characteristic. I’ll walk you the techniques we’ve attempted to observe cost, sharing the pros and cons of each, and I’ll also talk about how we’ve used this data as a basis for re-shaping team practices, building out an internal training and support program that has helped our whole engineering organization get fluent in balancing cost against our other operational and business concerns.
Watch
Monitorama PDX 2023 - If you look like a problem, developers will solve you
Ivan Merrill's session from Monitorama PDX 2023. Adding observability and monitoring to services is often unfortunately seen as a 'necessary' task before going live and mandated by organisations. As a result a culture is created where this important task is seen as something that needs to be done to get a tick in a box, a problem to be solved, as opposed to a really valuable tool for developers. If the path to implementing your chosen toolset and meeting the mandated requirements is too much of a problem it's likely teams will look to solve this problem another way without you. This talk explores how we can change this mindset, and work to create a culture where the value of implementing and using monitoring is understood by developers and product management. I will share what I've found has worked in previous roles in large financial organisations and try to provide actionable advice. The path to better monitoring and greater levels of observability doesn't involve changing tools but is in fact one of human interactions.
Watch
Monitorama PDX 2023 - Meet Zeek, the extensible, scriptable network monitor
Christian Kreibich's session from Monitorama PDX 2023. Network monitoring is key for understanding your infrastructure, whether that's your home network or a thousand-seat corporate environment. Using its domain-specific scripting language, the Zeek network monitor helps you turn the packets in your network into streams of actionable logs, organized around protocols and themes that matter to you. Zeek is a mature, battle-hardened platform and ecosystem that runs on anything from Raspberry Pi's to to industrial-scale deployments, such as Microsoft Defender I am the technical lead for the Zeek project, and in this talk I'll give an overview of Zeek, its architecture and capabilities, and the goals of the project.
Watch
Monitorama PDX 2023 - Know your data: The stats behind your alerts
Dave McAllister's session from Monitorama PDX 2023. Quick, what's the difference between the mean, the mode and the median? And which mean do you mean?Do you need a Gaussian or a normal distribution? And does your choice impact the alerts and observations you get from your observability tools? Come get refreshed on the impact some basic choices in statistical behavior can have on what gets triggered. Learn why a median might be the choice for historical anomaly or sudden change. Jump into Gaussian distributions, data alignment challenges and the trouble with sampling. Walk out with a deeper understanding of your metrics and what they might be telling you.
Watch