List of videos

Monitorama PDX 2023 - Performance Testing Experimentation At Scale

Cliff Moon's session from Monitorama PDX 2023. The traditional statistical models used in A/B testing are built to support product decision making around things like buttons clicked, messages sent, etc. In other words normally distributed metrics. However what happens when we want to make decisions about the performance impact of an experiment? Performance metrics are decidedly non-normal, and typically subject to a long tail. Averages of such a dataset only have enough precision to surface the most egregious performance degradations. In this talk we'll discuss the development of a system for catching performance degradation of A/B experiments in an environment with thousands of concurrent experiments at any give period of time.

Watch
Monitorama PDX 2023 - How to Scale Observability Without Bankrupting the Company

David Gildeh's session from Monitorama PDX 2023. Every company has struggled with controlling their Observability costs as the amount of data that needs to be collected, stored and queried in real time has gone up exponentially! Netflix has had to solve this problem as it rapidly scaled into a web scale company with 100m's of customers around the world. In this talk, David will share some of the innovations and strategies Netflix has used to ingest huge amounts of data, while providing sub-second queries to all its Engineers, without bankrupting the company!

Watch
Monitorama PDX 2023 - Have you tried not storing all those metrics

Tony Rippy's session from Monitorama PDX 2023. When running services in production, the temptation is to keep all the metrics you can. It is difficult to decide in advance what metrics are needed to debug problems, or what system performance metrics will be important in the future. As the saying goes: “It’s better to have it and not need it, than need it and not have it.” There are several problems with this. First, it is wasteful; most monitoring data is written once and never read. Second, it can be wildly expensive! I have heard stories of big tech companies that end up spending millions of dollars a year on their monitoring bills, enough to impact the bottom line of the business. It also causes scalability problems, as high-cardinality data sets are notoriously difficult to store and query efficiently. But… What if you didn’t need to store this data? What if there was a way to reduce the amount of data but continue to meet your day-to-day requirements? This is the idea behind this talk. It covers experiments that use empirical distributions and samples in place of time series. We will discuss the pros & cons of this approach, and lessons learned trying to apply this in practice.

Watch
Monitorama PDX 2023 - Monitoring Mastodons: A story about Hachyderm

Hazel Weakly's session from Monitorama PDX 2023. Elephants may have a perfect memory, but humans sure don't; we're lucky to even see what we're looking at much less figure out what else is going on. Speaking of which, what ARE we even looking at over here at Hachyderm? We've had our share of major incidents and ongoing work; how has monitoring helped with that, and how has it not? These questions and more are going to be what we go over! Tune in to find out more about the monitoring stack we've built, why we chose it, and what we're doing next. While we're at it, we're going to be taking a deeper look at what monitoring is even for, how to know what you need, and the philosophical implications of it all.

Watch
Monitorama PDX 2023 - Alerts Don't Suck YOUR Alerts Suck!

Leon Adato's session from Monitorama PDX 2023. Nobody "likes" getting alerts. Best case, it tells you something went (or is about to go) wrong. But more often they're are meaningless, trivial, or just plain wrong - a source of constant interruptions, false alarms, unplanned work, and noise. While some say this is the inherent nature of alerts (and monitoring in general) the truth that well-crafted alerts based on insightful monitoring are a gift - saving hours of investigation and thousands of dollars. Whether your organization views alerts a curse or a blessing depends on the design and implementation of those alerts, more so than any specific monitoring tool or technique. And, like most things in technology, good design can be taught and learned. In this talk, I'll give a brief tour of the alerting hall of horrors, and then provide real-world, vendor-agnostic techniques to make alerts meaningful, effective, valuable, and actionable (as a bonus, I'll show how to make them manageable, too!). By breaking a few bad habits; understanding how and why vendors put their tools together in particular ways; and learning a few new concepts, you'll have people emailing you to say "thank goodness I got that alert!". Now there's something you probably don't hear every day.

Watch
Monitorama PDX 2023 - Connecting the dots with OTel semantic conventions

Richard Benwell's session from Monitorama PDX 2023. How do we make use of the increasing volume of observability data that we collect? Observability was inspired by control theory, but current observability solutions are missing a key element of that theory: the system model. We can’t understand the state of the system, or ‘answer unknown unknowns’ if we don’t know how the system works. We’re drowning in data but starved of answers! Discover how graphs (think social network graph, not line graph) and Open Telemetry semantic conventions can help us connect the dots.

Watch
Monitorama PDX 2023 - What we talk about when we talk about Observability Frameworks

Timothy Mahoney's session from Monitorama PDX 2023. It's important to have consistent data across an organisation but to insure data consistency, we have to insure the people responsible for producing that data have a common understanding. In implementing an observability framework for Ikea, we had to overcome not only technical hurdles but issues with taxonomy, semantics and language barriers to insure a common understanding among teams. I want to share my experiences as a senior engineer in an observability pipeline team and how we slowed down to speed up our company's observability journey.

Watch
Monitorama PDX 2023 - OpenResty, My Bestie: Tracing NGINX With Lua

Sam Handler's session from Monitorama PDX 2023. Shopify served 75.98 million requests per minute during Black Friday/Cyber Monday 2022, and our OpenResty deployments handled each of these requests before they hit an application server (OpenResty is a technology that lets you embed arbitrary Lua scripts into NGINX configuration files). Until recently, our routing stack was completely untraced, which left a huge blind spot in our view of our infrastructure. In 2022, we finally implemented tracing in our OpenResty deployments, and it wasn’t easy. In this talk, I’ll describe how we got a working tracing implementation. Along the way, I’ll explain the dangers of custom trace propagation formats, the joys of working in a well-specified open source project, the wonders (and challenges) of the OpenResty runtime, and the mental challenges that accompany the modification of NGINX, that famously performant HTTP server and reverse proxy.

Watch
Monitorama PDX 2023 - Building a real time cloud cost management program with observability

Emily Nakashima's session from Monitorama PDX 2023. Advances in monitoring and observability have given so many of us the confidence that we knowing what's happening in every corner of our systems, but for many teams, one system facet remains stubbornly un-observable: cost. This talk will tell the story of how a surprise giant AWS bill sent our growing startup engineering team on a mission to be able to observe our cloud spend with the same clarity and immediacy as performance, reliability, or any other important system characteristic. I’ll walk you the techniques we’ve attempted to observe cost, sharing the pros and cons of each, and I’ll also talk about how we’ve used this data as a basis for re-shaping team practices, building out an internal training and support program that has helped our whole engineering organization get fluent in balancing cost against our other operational and business concerns.

Watch