List of videos

Monitorama PDX 2024 - Is Your Kernel Being Honest? Understanding & measuring low level bottlenecks

Ferris Ellis' session from Monitorama PDX 2024. User-space is a pleasant and happy place. It is full of simple numbers like CPU utilization and memory usage. Numbers that the Linux kernel tells you and swears are true. But beneath user-space lies a deep and dark land that is full of strange caches, queues, clocks, and counters. Are you prepared to observe this hidden land and find secrets the kernel keeps from us all? Presented as a guided exploration, this talk will focus on observing and understanding low level CPU performance. Starting from a common user-space program and working our way down we will demystify memory pages, CPU caches, instructions-per-cycle, and more. Along the way we will pull in Linux tooling to see these in a live system — catching the kernel in its lies. Finally, we will discuss how we can instrument production systems to measure low level performance and know when the kernel is telling us the truth and when we are silently leaving CPU cycles unused.

Watch

Monitorama PDX 2024 - The complexity of success and failure: the story of the Gimli Glider

Thai Wood's session from Monitorama PDX 2024. A look at monitoring, attention, incidents, and complexity through the lens of the story of the Gimli Glider, a plane that ran out of fuel mid-flight, but was able to be landed successfully on the ground anyways. I'll cover things like: * How dashboards and gauges can mislead, especially depending on the context around their usage and viewing * How expertise contributes to interpretation of monitoring and responding to incidents. * The negotiable, changing definitions of failure in an incident, that also mean some traditional forms of monitoring are insufficient. * How we, as an industry, can help make sure we have the right monitoring and resources in place to make a safe landing even when things go wrong.

Watch

Monitorama PDX 2024 - Pugs, Poe's and pipelines; An engineering perspective on big-data streams...

David Josephsen's session from Monitorama PDX 2024. At scale, all Observability projects are data-engineering problems, which require big-data tools and techniques to solve. But many of these tools -- particularly general-purpose streaming frameworks -- often make less than ideal trade-offs which negatively impact latency, or make troubleshooting difficult. Meanwhile, a lot of purpose-built monitoring tool-chain either doesn't scale as well as it claims, or requires nefarious hacks to reach your target ingestion rate. In either case, these systems invariably wind up costing more than you intended. Whether you're designing a high-volume telemetry pipeline from scratch, or shoring up an existing system that's having trouble scaling, I want to share with you a powerful, reductive thought-pattern that has helped me build and maintain 5 different massive-scale telemetry pipelines in as many years. In this talk, I'll introduce you to my friends, Poe (point of enrichment) and Pug (Point uf aGGregation), and together we'll learn about how they can help you define your tooling requirements, reduce your end-to-end latency, and perhaps most importantly, control spend.

Watch

Monitorama PDX 2024 - How we tricked engineers into utilizing distributed tracing

Noa Levi's session from Monitorama PDX 2024. Strava implemented distributed tracing over 4 years ago, yet the tool has been historically underutilized. We’ve improved documentation, presented tech talks, and created training modules to increase the adoption of the tool, yet there was little to no change in tracing usage. Though underutilized, we’ve chosen to continue using and investing in improving our distributed tracing tooling given how impactful the tool is to those that do use it. This talk will explain how we inadvertently increased adoption and improved our distributed tracing tooling while migrating from OpenTracing to OpenTelemetry. The migration involved updating configuration, redeploying, and testing over 100 microservices. Strava’s platform team did not have the context to safely deploy and test those services, so that work was assigned out to each team at Strava. Practically what this meant was that every team ended up poking around our trace visualization tooling, many of them for the first time. Engineers were able to learn about tracing within the context of a service they were familiar with. Almost immediately, we noticed more engineers using the tooling and even making suggestions for improvements to enhance their data. While our intention of asking teams to migrate their own services was to prevent production downtime, there was a happy accident of increasing adoption.

Watch

Monitorama PDX 2024 - Experiments in Backing Prometheus with Clickhouse

Colin Douch's session from Monitorama PDX 2024. Columnar datastores like Clickhouse are becoming increasingly popular when it comes to bulk storage of Observability data. This is due to their efficient enabling of high cardinality and high dimensionality datasets that appear natively in Observability workloads. At the same time, one of the chief complaints about Prometheus is its seeming inability to handle these workloads, with common advice being to reduce labelsets as much as possible to reduce the load on Prometheus. With the invention of Prometheus Remote Write and Read however, we are empowered to experiment with alternative backing stores for Prometheus. In this talk, I will introduce a methodology for offloading Prometheus storage and retrieval to Clickhouse. Using that method, I will cover the ups and downs of using Clickhouse as an external Prometheus storage, including storage usage, query efficiency, and operational overhead, and opine on how parts of Clickhouses design might help inform Prometheus improvements in the future

Watch

Monitorama PDX 2024 - From Alerts to Insights: Performing Trace-Based Causation at Scale

Logan Rosen's session from Monitorama PDX 2024. Our team focuses on improving the observability experience for engineers at our company, specifically in triaging/performing causal analysis for alerts. Previously, we have taken a time-based correlation approach, which can yield success but is not always point to the right problem. Especially in a multi-layered stack of many microservices, we would often end up with red herrings when comparing metrics that seemed to match up in shape but didn't end up being related. It is important that we point engineers in the right direction, and quickly, in order to reduce the amount of time it takes to resolve site impact. To remedy this, we focused on leveraging distributed tracing for determining causation – by stream processing the events emitted at each hop of a request, we could deterministically point to the problems that led to the alerts very soon after they arise. Specifically, we could look at chains of errors and/or latency on a per-trace basis and use aggregate counts over time to provide a causal graph to engineers debugging site issues. We were able to build this by leveraging Go and Kafka, and it required significant amounts of performance tuning and careful coding to make it efficient as possible, given the immense scale of trace events being processed. The data coming out of this pipeline has shown tremendous promise, and we aim to surface it to our users by the end of this year/early 2024. This will be a structured talk that walks audience members through our journey of maturing our triage/RCA approaches and building this pipeline, as well as through the technical challenges we encountered/how we surmounted them. It is targeted at engineers looking to improve their observability tooling within their own organizations.

Watch

Monitorama PDX 2024 - From Polling to Streaming: Network Monitoring with Real-Time Telemetry

Tushar Gupta's session from Monitorama PDX 2024. In this talk, attendees will dive into the world of streaming telemetry, learning its basics and how it's better than old-school monitoring for watching over networks in real time. We'll walk through essentials of integrating streaming telemetry into network deployments, from setting up data sources and choosing the right transport protocols to interpreting the rich data streams to make informed decisions that can optimize network performance and reliability. This segment will also cover the best practices, based on insights from real-world implementations, needed to design and deploy robust telemetry solutions that enhance observability and drive operational excellence. The latter part of the talk will will touch on some of the considerations and common hurdles that come with implementing streaming telemetry, such as managing large amounts of data generated, its storage and prioritization. The talk will then delve into practical tips on how to handle the data effectively, aiming to equip attendees with the know-how to smoothly integrate streaming telemetry into their systems. The end goal of the session is for attendees to be able to assess whether streaming telemetry aligns with their needs and, if so, be able to implement it to leverage its advantages while sidestepping potential pitfalls.

Watch

Monitorama PDX 2024 - Serverless Observability: a case study of SLOs

Virginia Diana Todea's session from Monitorama PDX 2024. Fine tuning the Service Level Objectives is a golden rule for the industries adopting them. This talk explores the case study of migrating to Kubernetes and how SLOs and transforms can backfire if not managed correctly. The talk starts by presenting the reasons why SLOs are important in the software development framework. We will then present our case study on how migrating to Kubernetes exposed our system to SLO definition and the challenges our system faced when trying to measure and adhere to our SLOs. We will deep dive in the main bottlenecks we encountered when trying to apply our SLOs across our multi cluster infrastructure, how we dealt with these challenges by correlating our SLOs with burn rates and transforms. We highlight that the observability system can backfire as a consequence if the metrics are not properly adjusted. We wrap up our talk by drawing the main lessons from our case study and the benefits to the ecosystem: SLOs are paramount for the software development lifecycle and implementing them correctly should follow a process of continuous improvement alongside with adopting the right tools and metrics.

Watch

Monitorama PDX 2024 - Distributed Tracing - All The Warning Signs Were Out There!

Vijay Samuel's session from Monitorama PDX 2024. We've all heard it time and time again - "distributed tracing is hard". The sad reality is that there is a lot of truth to this statement. All too often we keep hearing buzz words and statements around how magical distributed tracing is and how it can easily cut down time to triage. However, when we talk about tracing at scale with ~5000 micro services and 100s of databases, the nuances of tracing start bubbling up pretty fast. Some of which include: * infinitely large trace waterfalls that become hard to grasp * A handful of mis-instrumented applications that lead to lost confidence in the system Many early adopters told us that it would be hard and we tried different techniques to mitigate some of those issues early but we ultimately hit the wall of pain. This talk describes how these real problems affect the Observability team at eBay and how we are approaching these problems to hopefully make tracing work in the future.

Watch