List of videos

Improve Resilience with automated chaos engineering | Gunnar Grosch | Conf42 SRE 2021

Gunnar Grosch - Developer Advocate @ AWS The transition into more complex systems is accelerating, and chaos engineering has proved to be a great-to-have option in our toolbox to handle this complexity. But the speed at which we’re developing and deploying makes it hard to keep up through manual chaos experiments, so we turn to automation. In this session, we’ll look at how automated chaos experiments help us cover a more extensive set of experiments than we can cover manually and how it allows us to verify our assumptions over time as unknown parts of the system change. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Peek into Observability from testers lens | Parveen Khan | Conf42 SRE 2021

Parveen Khan - Senior QA Consultant @ ThoughtWorks It is common yet quite new to hear the term ‘Observability’. But what does that mean? Is it just another new acronym for monitoring? In this current modern technology world where we are working with so many different types of systems - microservices, distributed system and many others which are kind of huge spider webs. Imagine while testing these distributed systems and you have no clue of what’s going on under the hood. Gone are the days where testers have to rely only testing the user interface or the api’s. I worked on a distributed system where no one had an answer to what’s going wrong whenever there was a production issue each time. We had some monitoring and logging in place but we had no clue where to look at when things went wrong on production. There was a need to have more powerful insights of the internals of the system and more than that there was a need for visibility to understand what’s happening under the hood of the system which is giving the team the superpowers to predict the future. This is where we started first steps into ‘Observability’. In this talk, I’ll share my journey of adoption of a culture of observability within the engineering team. Key takeaways * What is observability * Why is it important * How does it help the team * How observability can support testing. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Monitoring your platform from multiple locations | Andrei Danilov | Conf42 SRE 2021

Andrei Danilov - Platform Engineer @ Typeform Setting-up a system of monitoring production availability in multiple regions might seem pretty straightforward if we’re looking at this strictly from a tooling perspective: Postman, DataDog, New Relic, AWS, Uptrends, Cloudfare (and the list can go on), they all offer solutions that are super easy to setup, manage and maintain. However, the real challenges are behind the use of the tool: - How do we select the regions and locations to monitor? - Do we want to monitor exact particular location (eg: a city) or broad areas (eg: countries) - What do we consider a failure (typical HTTP error codes, certain time thresholds?) - Do all the selected areas have the same importance to the business? - Do all the monitored areas trigger the alerts with the same priority? - How often do we want our platform to be monitored? - Who owns the actions to fix this in case of failure? - How do I flag these monitors so I don’t mess up with company data? - How do we design the monitors so in case of failure, we don’t get a cascade of alerts? In this session I plan to answer all of the above questions, combining with examples from my personal work experience on the topic. An agenda would look something like this: - About the Speaker - Why is the topic important? - Building your own strategy: things to consider - Choosing the tool - Showcase of an already working solution - Conclusions - Q&A on Discord After the session you’ll be able to: - Get a good grip of why this topic is important and in which context - Understand how to create a strategy for monitoring different world areas that applies to the realities of their own company - Understand what tools they can use to achieve that Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
The Murphy's Laws of Observability | Dave McAllister | Conf42 SRE 2021

Dave McAllister - Senior Technical Evangelist @ Splunk We’re all familiar with Murphy’s Law. “Anything that can go wrong, will go wrong.” And over time, Murphy’s Law has been extended, abstracted and applied to numerous disciplines. Within Observability, the extension of data and monitoring focused on deep insights in our complex apps and environments, Murphy’s Law also reigns supreme. But what laws apply and how do we mitigate impact of points like: - Anything that can go wrong, will go wrong, at the worst possible time - Computers always side with the hidden flaw - You can never run out of things that can go wrong - Availability is a function of time We’ll look into examples of the laws in practice and highlight practices that might just give you the one-up on Murphy. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Observability in Serverless Application | Ozioma Uzoegwu | Conf42 SRE 2021

Ozioma Uzoegwu - Solutions Architect @ AWS In this session, we will go through an introduction to observability and how to implement the concepts to serverless applications on AWS. We will cover the key services and tools and how to implement observability at both the infrastructure and application layers Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
System State Clustering using eBPF data | Sujith Samuel | Conf42 SRE 2021

Sujith Samuel @ Principal Software Engineer @ Ericsson The field of system observability has been greatly enhanced by the application of eBPF. eBPF generates data at critical points in the execution of a system and that data is used for observation via software like Sysdig and Cilium. I propose to utilize the data generated for system state clustering. This is an application of machine learning to the above data to understand if the system is behaving properly or not. The amalgamation of machine learning and system data generation in real-time would open the doors to a plethora of applications like system state prediction, preventive replacement of system components aided by ML. This talk will take the attendees through an idea of how this could be done. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Investigating Performance Issues In Microservices Arch. | Dotan Horovits | Conf42 SRE 2021

Dotan Horovits - Product Evangelist @ Logz.io Running a multi-tenant SaaS at scale is no easy task. With the massive scale-out we’ve started encountering performance issues. Investigating those issues turned out tricky with our microservices architecture running on Kubernetes and Docker containers across multiple regions and multiple cloud providers. We run an Observability SaaS platform, both using it internally and offering it to others. Our system is instrumented inside and out around logging and metrics, but that proved not to be the right tool for the job. We needed another weapon for our performance blitz. This is how we got to the world of distributed tracing, first as practitioners and then also started offering it as part of our Observability platform. In this talk I’ll share our journey to distributed tracing and Jaeger open source project, how it helped us overcome our performance issues in our application across the stack from the Node.js down to the Java and database backends, and how it has become an integral part of our daily routine. If you’re battling performance issues, if you’re considering making your first steps into distributed tracing - this talk is for you. I’ll show useful examples, best practices and tips to make your life easier in battling performance issues and gaining better observability into your system, as well as how to make this a gradual and smooth journey, even into high-scale production systems. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Improve observability for your container workloads | Suraj Muraleedharan | Conf42 SRE 2021

Suraj Muraleedharan - Senior DevOps Consultant @ AWS In this session we will see how to improve observability of container workloads focusing on the three pillars of monitoring, logging and traceability. In the operational performance, we will discuss how to detect behaviours that deviate from normal operating patterns. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Reducing Trauma in Production | Julie Gunderson & Mandi Walls | Conf42 SRE 2021

Julie Gunderson - Senior Reliability Advocate @ Gremlin Mandi Walls - DevOps Advocate @ PagerDuty Customer experience is the responsibility of the entire team. Many organizations leave reliability up to the SRE team, however reliability should be built in from the very beginning. In this talk Julie and Mandi will discuss what Service Levels Objectives are, why they are important to the organization, and how to define and set them. Going beyond SLOs, attendees will learn what Chaos Engineering is and practical ways to ensure compliance and resilience with best practices. We’ll show you how to focus your goals and error budgets with examples that will lead to reliability and improved user experience. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch