List of videos

Driving Service Ownership with Distributed Tracing | Daniel 'spoons' Spoonhower | Conf42 SRE 2020
Daniel "Spoons" Spoonhower CTO & Co-Founder @ LightStep LinkedIn: https://www.linkedin.com/in/spoons/ While many organizations are rolling out Kubernetes, breaking up their monoliths, and adopting DevOps practices with the hope of increasing developer velocity and improving reliability, it’s not enough just to put these tools in the hands of developers: you’ve got to incentivize developers to use them. Service ownership provides these incentives, by holding teams accountable for metrics like the performance and reliability of their services as well as by giving them the agency to improve those metrics. In this talk, I’ll cover how distributed tracing can serve as the backbone of service ownership. For SRE teams that are setting standards for their organizations, it can help drive things like documentation, communication, on-call processes, and SLOs by providing a single source of truth for what’s happening across the entire application. For embedded SRE teams, it can also accelerate root cause analysis and make alerts more actionable by showing developers what’s changed – even if that change was a dozen services away. Throughout the talk, I’ll use examples drawn from more than a decade of experience with SRE teams in organizations big and small. — 🥇 Gold Sponsors: LightStep Google 🥈 Silver Sponsors: MayaData Aval Digital Labs Elastic The Pathwayz Group 🤝 Media Partner JetBrains — 0:00 Intro 0:35 Talk — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch
From Application to Product Ownership: an SRE Team's Journey | Nikolaus Rath | Conf42 SRE 2020
Nikolaus Rath Tech Lead @ Google In the past, my team was responsible for specific applications/executables. Now, we are responsible for specific for end-user workflows, no matter which executables they involve. I will describe the technical and social changes that necessitated and enabled this change of paradigm At the end of 2019, our team supported in the order of 200 executables/microservices that provide functionality for Google's products in the advertising space. This number was the result of continuous growth over more than 10 years. Over this time, we nevertheless maintained constant operational load by making the executables as homogenous as possible, automating almost all non-emergency maintenance tasks, and routinely involving developer teams when service-specific knowledge was required. However, this growth has come at a price elsewhere. While we maintained technical expertise and overview, we lost familiarity with the user-facing products that our executables implement. For example, we could reliably assess the impact of an issue through service level indicators (SLIs) for availability and latency, but we could not immediately tell what product users were experiencing - were they without autocompletion for a text input, were they missing labels on a graph, or did an entire page become non-functional? To address this issue, we fundamentally changed the way that we operate and engage as an SRE team - without compromising on the number of services that we support. We began by redefining our team's scope based on products rather than executables. While in the past we took the pager for specific executables, we now take responsibility for products as a whole. This means we are monitoring the most important interactions with a product - no matter which executables provide them - but not necessarily all functionality that resides in any one executable. Correspondingly, we then changed SLIs to map to user interactions instead of reflecting the implementation architecture. This enabled us to describe any SLO misses in the same terms that an external user would use. Finally, we have reduced continuous work on specific binaries in favor of time-limited "deep-dives" that address specific reliability issues, while leaving day-to-day operation increasingly to application developers with SRE taking a supporting role to advise and handle large-scale incidents. — 🥇 Gold Sponsors: LightStep Google 🥈 Silver Sponsors: MayaData Aval Digital Labs Elastic The Pathwayz Group 🤝 Media Partner JetBrains — 0:00 Intro 0:35 Talk — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch
SREs ❤️ Chaos Enegineering | Mikolaj Pawlikowski | Conf42 SRE 2020
Mikolaj Pawlikowski Software Engineer Project Lead @ Bloomberg LP ✋ Opinions and views expressed by Mikolaj are his own, and are not representative of his employer. What's Chaos Engineering? Is it part of SRE? Is it breaking things randomly in production? In this talk, we'll try to settle these questions once and for all, and to give you a life-like demo of what Chaos Engineering looks like in practice! — 🥇 Gold Sponsors: LightStep Google 🥈 Silver Sponsors: MayaData Aval Digital Labs Elastic The Pathwayz Group 🤝 Media Partner JetBrains — 0:00 Intro 0:35 Talk — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch
Building and Leading Remote Teams | Amber Vanderburg | Conf42 SRE 2020
Amber Vanderburg Founder @ The PathWayz Group ; BusinessPerson, Coach, Speaker https://www.linkedin.com/in/amber-vanderburg-86817833/ The world of work is constantly changing as we create new products, provide excellent service, and collaborate on new ventures. I'll give you tools to overcome remote team challenges from confronting communication frustrations, setting expectations, and strategically building/equipping the right-fit remote team. You will be able to walk away with practical tools, tips, and tricks that you can implement within your team to help perform with more clarity and direction, more straightforward communication and expectations, and better performance individually and as a team. This high energy chat will give you insights from a little bit of theory and case studies, a load of practical application, and lots of laughter. You will be able to walk away with the knowledge of how to: - Be strategic in building a team that can thrive in a remote environment - Equip and prepare your team with the tools that they need to be more successful working remotely - Aid in the logistical challenges of a remote team - Aid in the relational challenges of a remote team - How to best engage and motive a remote team Detailed Program Description: Intro activity The benefits and challenges of a remote team (a brief data review) Questions and criteria for building your great remote team How to onboard, equip, and bring value to your remote team Practical tips and tricks to aid in remote logistical challenges Practical tips and tricks to aid in remote relationship challenges Relational challenges activity Engaging your remote team Conclusion and Q&A — 🥇 Gold Sponsors: LightStep Google 🥈 Silver Sponsors: MayaData Aval Digital Labs Elastic The Pathwayz Group 🤝 Media Partner JetBrains — 0:00 Intro 0:35 Talk — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch
Applied Security | Aaron Rinehart & Jamie Dicken | Conf42 SRE 2020
Aaron Rinehart - CTO @ Verica & Jamie Dicken, Manager of Security Engineering @ Cardinal Health 👾💾 Warning: Cheesy 80s memes! Modern systems pose a number of thorny challenges and securing the transformation from legacy monolithic systems to distributed systems demands a change in mindset and engineering toolkit. The security engineering toolkit is unfortunately out-of-style and outdated with today's approach to building, security and operating distributed systems. The speed, scale, and complex operations within microservice architectures make them tremendously difficult for humans to mentally model their behavior. Security Chaos Engineering helps teams realign the actual state of operational security as well as build confidence that their security actually works the way we think it does. Join Jamie Dicken and Aaron Rinehart to learn about how they implemented Security Chaos Engineering as a practice at their organizations to proactively discover system weakness before they were taken advantage of by malicious adversaries.In this session Jamie and Aaron will introduce a new concept known as Security Chaos Engineering and share their experiences in applying Security Chaos Engineering to create highly secure, performant, and resilient distributed systems. — 🥇 Gold Sponsors: LightStep Google 🥈 Silver Sponsors: MayaData Aval Digital Labs Elastic The Pathwayz Group 🤝 Media Partner JetBrains — 0:00 Intro 0:35 Talk — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch
Machine Learning in Production - MLOps | Ryan Dawson | Conf42 SRE 2020
Ryan Dawson Core Member @ Seldon Open Source Team LinkedIn: https://www.linkedin.com/in/ryan-dawson-501ab9123/ Reliably deploying and maintaining machine learning applications is complex. There's a dizzying array of tools and they look different from the usual DevOps tools. To apply SRE skills to ML, we need to understand the specific challenges of ML build-deploy-monitor workflows. We'll use reference examples to understand the cycle in terms of data prep, training, rollout and monitoring. We'll see that some key challenges relate to training models from slices of large and varying data domains - a problem alien to the mainstream DevOps world. — 🥇 Gold Sponsors: LightStep Google 🥈 Silver Sponsors: MayaData Aval Digital Labs Elastic The Pathwayz Group 🤝 Media Partner JetBrains — 0:00 Intro 0:35 Talk — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch
Security Chaos Engineering | Yury Niño Roa | Conf42 SRE 2020
Yury Nino Roa SRE @ Aval Digital Labs Chaos Gamedays have been successfully probed in the training of operations and on-call teams. However, they have not been explored completely when the failures are related to cyberattacks. In this talk we are going to explore how to adapt the methodology for Chaos Gamedays whit security experiments. First, the talk describes the foundations of security attacks: viruses, malware, ransomware, trojans, and cyberattacks. Second, a description of current techniques for high severity incident management will be explored: recording, triaging, tracking, and assigning business value to problems that impact critical systems. Third, it mentions the classical methodologies for training security engineering teams: red/blue teaming, purple teaming, and tabletop exercises. Fourth, a framework based on these classical methods and Chaos Gamedays are presented. The intention is to ensure that the teams operate effectively during a cyberattack and respond with resilience strategies to solve them. Fifth, it includes some recommendations result from our learning practicing these types of exercises. — 🥇 Gold Sponsors: LightStep Google 🥈 Silver Sponsors: MayaData Aval Digital Labs Elastic The Pathwayz Group 🤝 Media Partner JetBrains — 0:00 Intro 0:35 Talk — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch
Increasing Kubernetes Resilience for an SRE | Umasankar Mukkara | Conf42 SRE 2020
Uma Mukkara COO @ MayaData LinkedIn: https://www.linkedin.com/in/uma-mukkara/ SREs main task is to keep the operations up and running. An SRE dealing with Kubernetes has many challenges to keep resilience is at the desired level and improving over time. In this talk we will go through techniques to measure and improve resilience of Kubernetes platforms in a Cloud-Native way. The number of micro services in a Kubernetes environment can grow into hundreds easily. Continuous upgrade of these micro services and the Kubernetes platform itself will need a system to measure resilience of the deployment. SREs need to practice chaos engineering in a cloud native way in such a way that it is easily manageable, reuses as many as chaos experiments and workflows. This talk is intended for those SREs that would like to practice or are already practising chaos engineering in their environments. In this talk, we will introduce chaos hub for SRE and discuss how to construct complex chaos workflows using Litmus and Argo projects. A live demo will take the audience through the construction of a end-to-end chaos workflow involving Kubernetes node failure, a CPU hog, a network slowness in a ecommerce application and how resilience is measured and monitored during this process. — 🥇 Gold Sponsors: LightStep Google 🥈 Silver Sponsors: MayaData Aval Digital Labs Elastic The Pathwayz Group 🤝 Media Partner JetBrains — 0:00 Intro 0:35 Talk — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch
Tinkerbell - an automated Provisionning Engine | Aman Parauliya | Conf42 SRE 2020
Aman Parauliya Senior Software Engineer @ Infracloud In Cloud Native world, bare metal servers are critical for performance & security related applications. Tinkerbell solves the problem of provisioning and lifecycle management for bare metal. Starting with bare metal concepts, we will cover provisioning from small IOT devices to big rack servers. The talk covers bare metal concepts, provisioning and lifecycle management of baremetal servers with Tinkerbell. Following is quick overview of my talk: * Bare Metal concepts * Need of Tinkerbell and how it works? * Tinkerbell components, concepts and their usages * Architecture Overview * How it uses the: - a YAML based definition to automate provisioning - a control plane to manage servers at scale - latest technologies like Docker containers, gRPC, iPXE, etc. * Demo: provisioning a bare metal server - Walk through of workflow: yaml based definition of automated provisioning - Run the workflow * What’s next for Tinkerbell and comparison with solutions? I would showcase a demo of creating and running a workflow on a bare metal machine. — 🥇 Gold Sponsors: LightStep Google 🥈 Silver Sponsors: MayaData Aval Digital Labs Elastic The Pathwayz Group 🤝 Media Partner JetBrains — 0:00 Intro 0:35 Talk — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch