Conf42 Site Reliability Engineering (SRE) 2021

2021

List of videos

Premiere - Conf42 Site Reliability Engineering 2021

Improve the resilience of your systems with Conf42! We've got 49 SRE-centric talks for you! Interactive event schedule: https://www.conf42.com/sre2021 Discord server: https://discord.gg/DnyHgrC7jC Chapters below: 00:00 Sponsored Segment 6:40 Welcome! (Discord, Sponsors) Keynotes 🪐 7:41 Emily Arnott & Christina Tan - Blameless 8:33 Cristina Buenahora Bustamante - Cortex 9:07 Uma Mukkara - ChaosNative 9:37 Robert Ross - FireHydrant 10:06 Alayshia Knighten - Honeycomb.io 10:17 Itiel Shwartz - Komodor 10:41 Allen Vaillencourt - Teleport Observability 🦅 11:44 Parveen Khan 12:08 Andrei Danilov 12:34 Dave McAllister 13:20 Ozioma Uzoegwu 13:41 Sujith Samuel 14:28 Dotan Horovits 14:56 Suraj Muraleedharan Processes 🦘 15:12 Julie Gunderson & Mandi Walls 15:44 Chris Riley 16:22 Hari Krishnan‎ 17:03 Josh Armitage 17:33 Yishai Beeri 18:08 Marco Coulter Tools 🐒 18:53 Andrew Kirkpatrick 19:39 Filipi Pires 19:57 Rob Richardson 20:16 Yshay Yaacobi 20:57 Dewan Ahmed 21:34 Arhur Grishkievich Culture 🐙 22:08 Ajuna Kyaruzi 22:31 Quintessence Anx 22:47 Austin King 22:31 Dmitry Vinnik Best Practices 🦙 23:46 Mandi Walls 24:52 Shubhankar Sumar 25:22 Dmitry Chuyko 25:45 Pranjal Deo 26:11 Stefano Doni 26:42 Tim Davis 27:01 Vishnu Vardhan Chikoti 27:23 Asif Mujawar 27:59 Rafał Leszko 28:15 Joshua Arvin Lat Lessons Learned 🦉 28:35 Annie Talvasto 29:14 JJ Asghar 29:40 Daniel Selans 30:11 Noaa Barki & Shimon Tolts 30:47 Robert Barron 31:21 Raounak Sharma 31:54 Ricardo Castro 32:32 Andrew Robinson 33:00 Let the conference begin! Join the Discord to interact with folks!

Watch
Elephant in the Blameless War Room - Accountability | Emily Arnott & Christina Tan | Conf42 SRE 2021

Emily Arnott - Blog Content Writer @ Blameless & Christina Tan - Strategy @ Blameless How do you reconcile the ideals of blamelessness with a demand for blame? When is accountability actually required? We’ll navigate these challenges by explaining: - How to empathize with blameful people - we’ll look at how their goals align with yours, even if their methods are archaic - How to skilfully respond to a demand for blame - blameful peoples’ goals can be achieved blamelessly - here’s how to communicate that - When is accountability necessary? - sometimes accountability is part of the best way forward - let’s figure out when - How to be blamelessly accountable - true accountability requires blamelessness - we’ll show you why Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Evangelizing the SRE mindset | Cristina Buenahora Bustamante | Conf42 SRE 2021

Founding Engineer @ Cortex Most engineers respond to messages or emails from an SRE or security engineer with disdain. They often see the work of these teams as another hurdle to getting code out the door and a tax on their productivity. We know they’re wrong. We need to spread the SRE mindset and approach to all engineering teams and pivot their thinking towards “How can I build a solution that is resilient, secure, and scalable?”, and “How can I partner with my SRE and security teams to make this a reality?”. This talk will take a deep dive into the core principles of SRE thinking and how to create a culture of reliability and ownership, with practical takeaways that you can use with your own teams. Key Discussion Points (Outline): - How do SREs define their role? - How do engineers define the SRE role? - How do we bring these two together? - What does it mean to foster a culture of reliability and ownership? - How can you apply this to your delivery machine? - What products or services can help achieve these goals? Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Improving resilience through SLO validation using chaos engineering | Uma Mukkara | Conf42 SRE 2021

Uma Mukkara - CEO @ ChaosNative In this session, Uma Mukkara will talk about how Site Reliability Engineers can use Chaos Engineering to do continuous validation of Service-level Objectives and thereby improve the resilience of the systems they are operating on. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Pragmatic Incident Response: 5 lessons learned from failures | Robert Ross | Conf42 SRE 2021

Robert Ross - CEO @ FireHydrant Incident response is overwhelming. So where do you start? There’s a lot of advice out there, but it’s mostly theories that aren’t taking reality into account. So how do you get a process in place that actually works and scales? In this session, FireHydrant CEO and Co-Founder, Robert Ross, will share stories (good and bad) from his experience as an SRE and what 5 pragmatic tips he’s learned along the way on building a successful incident response process. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
SLI Negotiation Tactics for Engineers | Alayshia Knighten | Conf42 SRE 2021

Alayshia Knighten - Team Lead, Onboarding Engineering @ Honeycomb.io Service level indicators are quantitative measures of a service, which in turn, are measured by SLOs. This is not the talk you think it is. As Engineers, we have our own SLIs, which are Survival Level Indicators, that measure and define if we are okay or not okay at a job. What happens when the rockstar engineer, who performs essential task A and B, hasn’t taken vacation in 9 months? Over time, not meeting SLIs can take its toll on engineers. How do we avoid burnout, turnover, and wider destruction in our teams? In this session, I will review different strategies to identify human burnout versus company personal objectives.Engineers share the same importance as customers and we should provide technical love to them as well. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
How to Empower Developers to Troubleshoot K8s Independently | Itiel Shwartz | Conf42 SRE 2021

Itiel Shwartz - CTO @ Komodor In the “good old days”, Ops/IT teams were responsible for handling issues when applications crash. In the world of microservices, however, developers are required to take on a bigger role in identifying and fixing issues in production, sometimes without having proper tools, privileges or even training. So how can we empower developers to troubleshoot efficiently and independently? Join us as Itiel Shwartz, co-founding CTO at Komodor, discusses: - How to turn developers into k8s troubleshooting experts - What culture and mindset changes are needed to succeed - The link between chainge management and troubleshooting Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Keys or Certs for SSH Access? Why should I care? | Allen Vailliencourt | Conf42 SRE 2021

Allen Vailliencourt - Solutions Engineer @ Teleport Today’s software developers, DevOps teams, SRE’s, and SysAdmins are familiar with the concept of public-key cryptography for gaining access to remote resources but using public and private keys has its limitations and can be difficult to scale and manage. Consider leveraging short-term certificates for SSH access over keys and rest easy! This talk will go over the pros and cons of using keys vs certificates and how to get started using short-term certificates! Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Improve Resilience with automated chaos engineering | Gunnar Grosch | Conf42 SRE 2021

Gunnar Grosch - Developer Advocate @ AWS The transition into more complex systems is accelerating, and chaos engineering has proved to be a great-to-have option in our toolbox to handle this complexity. But the speed at which we’re developing and deploying makes it hard to keep up through manual chaos experiments, so we turn to automation. In this session, we’ll look at how automated chaos experiments help us cover a more extensive set of experiments than we can cover manually and how it allows us to verify our assumptions over time as unknown parts of the system change. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Peek into Observability from testers lens | Parveen Khan | Conf42 SRE 2021

Parveen Khan - Senior QA Consultant @ ThoughtWorks It is common yet quite new to hear the term ‘Observability’. But what does that mean? Is it just another new acronym for monitoring? In this current modern technology world where we are working with so many different types of systems - microservices, distributed system and many others which are kind of huge spider webs. Imagine while testing these distributed systems and you have no clue of what’s going on under the hood. Gone are the days where testers have to rely only testing the user interface or the api’s. I worked on a distributed system where no one had an answer to what’s going wrong whenever there was a production issue each time. We had some monitoring and logging in place but we had no clue where to look at when things went wrong on production. There was a need to have more powerful insights of the internals of the system and more than that there was a need for visibility to understand what’s happening under the hood of the system which is giving the team the superpowers to predict the future. This is where we started first steps into ‘Observability’. In this talk, I’ll share my journey of adoption of a culture of observability within the engineering team. Key takeaways * What is observability * Why is it important * How does it help the team * How observability can support testing. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Monitoring your platform from multiple locations | Andrei Danilov | Conf42 SRE 2021

Andrei Danilov - Platform Engineer @ Typeform Setting-up a system of monitoring production availability in multiple regions might seem pretty straightforward if we’re looking at this strictly from a tooling perspective: Postman, DataDog, New Relic, AWS, Uptrends, Cloudfare (and the list can go on), they all offer solutions that are super easy to setup, manage and maintain. However, the real challenges are behind the use of the tool: - How do we select the regions and locations to monitor? - Do we want to monitor exact particular location (eg: a city) or broad areas (eg: countries) - What do we consider a failure (typical HTTP error codes, certain time thresholds?) - Do all the selected areas have the same importance to the business? - Do all the monitored areas trigger the alerts with the same priority? - How often do we want our platform to be monitored? - Who owns the actions to fix this in case of failure? - How do I flag these monitors so I don’t mess up with company data? - How do we design the monitors so in case of failure, we don’t get a cascade of alerts? In this session I plan to answer all of the above questions, combining with examples from my personal work experience on the topic. An agenda would look something like this: - About the Speaker - Why is the topic important? - Building your own strategy: things to consider - Choosing the tool - Showcase of an already working solution - Conclusions - Q&A on Discord After the session you’ll be able to: - Get a good grip of why this topic is important and in which context - Understand how to create a strategy for monitoring different world areas that applies to the realities of their own company - Understand what tools they can use to achieve that Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
The Murphy's Laws of Observability | Dave McAllister | Conf42 SRE 2021

Dave McAllister - Senior Technical Evangelist @ Splunk We’re all familiar with Murphy’s Law. “Anything that can go wrong, will go wrong.” And over time, Murphy’s Law has been extended, abstracted and applied to numerous disciplines. Within Observability, the extension of data and monitoring focused on deep insights in our complex apps and environments, Murphy’s Law also reigns supreme. But what laws apply and how do we mitigate impact of points like: - Anything that can go wrong, will go wrong, at the worst possible time - Computers always side with the hidden flaw - You can never run out of things that can go wrong - Availability is a function of time We’ll look into examples of the laws in practice and highlight practices that might just give you the one-up on Murphy. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Observability in Serverless Application | Ozioma Uzoegwu | Conf42 SRE 2021

Ozioma Uzoegwu - Solutions Architect @ AWS In this session, we will go through an introduction to observability and how to implement the concepts to serverless applications on AWS. We will cover the key services and tools and how to implement observability at both the infrastructure and application layers Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
System State Clustering using eBPF data | Sujith Samuel | Conf42 SRE 2021

Sujith Samuel @ Principal Software Engineer @ Ericsson The field of system observability has been greatly enhanced by the application of eBPF. eBPF generates data at critical points in the execution of a system and that data is used for observation via software like Sysdig and Cilium. I propose to utilize the data generated for system state clustering. This is an application of machine learning to the above data to understand if the system is behaving properly or not. The amalgamation of machine learning and system data generation in real-time would open the doors to a plethora of applications like system state prediction, preventive replacement of system components aided by ML. This talk will take the attendees through an idea of how this could be done. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Investigating Performance Issues In Microservices Arch. | Dotan Horovits | Conf42 SRE 2021

Dotan Horovits - Product Evangelist @ Logz.io Running a multi-tenant SaaS at scale is no easy task. With the massive scale-out we’ve started encountering performance issues. Investigating those issues turned out tricky with our microservices architecture running on Kubernetes and Docker containers across multiple regions and multiple cloud providers. We run an Observability SaaS platform, both using it internally and offering it to others. Our system is instrumented inside and out around logging and metrics, but that proved not to be the right tool for the job. We needed another weapon for our performance blitz. This is how we got to the world of distributed tracing, first as practitioners and then also started offering it as part of our Observability platform. In this talk I’ll share our journey to distributed tracing and Jaeger open source project, how it helped us overcome our performance issues in our application across the stack from the Node.js down to the Java and database backends, and how it has become an integral part of our daily routine. If you’re battling performance issues, if you’re considering making your first steps into distributed tracing - this talk is for you. I’ll show useful examples, best practices and tips to make your life easier in battling performance issues and gaining better observability into your system, as well as how to make this a gradual and smooth journey, even into high-scale production systems. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Improve observability for your container workloads | Suraj Muraleedharan | Conf42 SRE 2021

Suraj Muraleedharan - Senior DevOps Consultant @ AWS In this session we will see how to improve observability of container workloads focusing on the three pillars of monitoring, logging and traceability. In the operational performance, we will discuss how to detect behaviours that deviate from normal operating patterns. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Reducing Trauma in Production | Julie Gunderson & Mandi Walls | Conf42 SRE 2021

Julie Gunderson - Senior Reliability Advocate @ Gremlin Mandi Walls - DevOps Advocate @ PagerDuty Customer experience is the responsibility of the entire team. Many organizations leave reliability up to the SRE team, however reliability should be built in from the very beginning. In this talk Julie and Mandi will discuss what Service Levels Objectives are, why they are important to the organization, and how to define and set them. Going beyond SLOs, attendees will learn what Chaos Engineering is and practical ways to ensure compliance and resilience with best practices. We’ll show you how to focus your goals and error budgets with examples that will lead to reliability and improved user experience. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Incident Response, Management & Alerts - where they fit in CloudOps? | Chris Riley | Conf42 SRE 2021

Chris Riley - Senior Technology Advocate @ Splunk Responding to incidents is not just about wiring up the right tools, it’s also a strategy and process to know how to respond, how to record details of incidents, and how to learn from them after everything has been resolved. There has been a lot of confusion about the relationship of incident response to incident management, and alerting. In this session we will talk about the differences and the best practices for these key processes in healthy cloud operations environments. Expect to learn in this session. 1) What is an alert and how does it feed incidents 2) What is the difference between incident response (IR), and incident management (IM) 3) What are the best practices for IR and IM tooling 4) How incidents are being handled in modern dev environments We will also talk about on-call scheduling, shift-left, and how machine learning supports incident response strategies. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Shift Left your Performance Testing | Hari Krishnan | Conf42 SRE 2021

Hari Krishnan - CEO @ Polarizer Technologies Does your team have to deal with performance issues late in their dev cycle? Does this lead to a lot of unplanned work in your sprints? What if I told you, that your team can validate performance-related hypotheses right within your sprints and on their local machines? Attend this talk to know more. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Lean Product Development Through SLOs | Josh Armitage | Conf42 SRE 2021

Josh Armitage - AWS Practice Lead @ Contino How can you strip over-processing out of your value stream? How can you understand and communicate what “good enough” looks like? How can you systematically challenge and remove assumptions from you consumers? In supporting enterprises in DevOps & agile transformations across three continents, I discovered a common theme, stories talk of what is to be enabled, but not how well it needs to perform. Recently, upon joining a platform team drowning under the weight of their backlog, I experimented with SLOs being the beating heart of our product development flow. To do this, we: - Revisited all previous core usecases and determined meaningful SLIs and SLOs - Made SLOs an inherent part of defining features - Built a lightweight infrastructure to translate technical changes to business events - Made a dashboard to transparently broadcast our SLOs - Made SLOs be the core of conversations with our consumers By doing this, we: - Removed assumptions from consumers, driving more meaningful conversations about value - Created a beachhead for a more data driven culture - Created a common understanding of what “good enough” looked like - Became better focused on maximising value delivery, over redundant gold plating of features You’ll learn: - How to adopt an SLO driven product development flow - From the mistakes we made along the way - Tips and tricks for defining SLOs for low frequency usecases - How to drive more impactful conversations Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
One Metric to Rule them All: Cycle Time | Yishai Beeri | Conf42 SRE 2021

Yishai Beeri - CTO @ LinearB If you only measure one metric this is it: Cycle Time. It is probably the most underrated and least understood metric in engineering. Yet this is the metric that comprises the most important aspects to measure in your engineering process. Here’s why. High performing teams know it’s a marathon and not a sprint, that a steady and swift flow of value wins out in the end, and shipping many small improvements quickly in aggregate translates to great gains for your organization (in the spirit of kaizen). The way to do this is by ruthlessly removing productivity killers like context switches, work in progress culture, and dead value that isn’t shipped. By measuring the process, and not the individuals you are putting your emphasis on the team effort, which is great for culture - in contrast to highly toxic individual performance metrics. This talk will focus on how to fine-tune and get Cycle Time right. It’s a mix of improving your planning, interfaces between product and engineering, communication, review and quality gates and release mechanisms, essentially everything your engineering process encompasses. If you can game this metric, then you will realize huge gains for your entire dev process and engineering organization. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Avoiding Goodhart's Law - Use SLO's as Tools Not Cudgels | Marco Coulter | Conf42 SRE 2021

Marco Coulter - Technical Evangelist @ Tech-Whisperer.com The concepts of SLI, SLO and Error Budget are there to balance risk (rates of change) and reward (business contentment). Using such metrics as red lines to punish teams, or force acceptance of risk by the business is missing the point. My experiences from SLA’s in service contracts for hospitals inform this conversation identifying that SLI, SLO and Error Budgets are better as a basis for conversations about the stress an application can withstand, and the three dimensions the measures should cover. This session takes Goodhart’s law from economic policy as a frame for reconsidering SLI’s and SLO’s, and offers a few hints for approaching the negotiation meetings. Leave this session inspired to approach your SLO negotiations in the best possible way. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Self-service PR-based automated Terraform | Andrew Kirkpatrick | Conf42 SRE 2021

Andrew Kirkpatrick - Staff Engineer @ PartnerStack Maintaining your whole infrastructure using Terraform and reusable modules makes most of our lives easier, but when those less familiar with “DevOps” want to create or update resources, you usually either have to train and enable them to use Terraform, or handle the request yourself. However what if you could offload the execution of those changes to a centralised tool and just review both the code and output being submitted for review? Atlantis, Terraform Cloud or env0 can act as a PR-based feedback loop for a hosted Terraform executor to make self-service a little bit easier. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Keep your code safe during the development path using OS tools | Filipi Pires | Conf42 SRE 2021

Filipi Pires - Principal Security Engineer @ Talkdesk Practical demonstration of how a Developer can use a SAST tool for static analysis in code vulnerability, executing it in source code, byte code and/or binary and identifying security holes during the development process, analyzing many languages and codes, like as, C, C #, Java, Kotlin, Python, Ruby, Golang, Javascript, JSON… And searching for key leaks and security flaws in all files of your project, as well as in Git history and in addition to receiving a managerial view with all this analysis information. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Level-up Your DevOps with GitHub Actions and Kubernetes | Rob Richardson | Conf42 SRE 2021

Rob Richardson - Developer Advocate @ Cyral Are you looking to rapidly deploy your content? Are Docker containers in your future? Come for this demo-only presentation where we start from scratch, build up a DevOps pipeline in GitHub Actions, and deploy to Kubernetes. Once setup, commit, and watch the magic flow into place. You too can automate your deployments. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Instant Self-contained Development Environments for Everyone | Yshay Yaacobi | Conf42 SRE 2021

Yshay Yaacobi - CTO @ Livecycle It has become increasingly difficult and time consuming to start working on a new codebase, especially in a polyglot microservice world. Using several patterns and developer containers we can create an amazing developer experience that will allow anyone to instantly deep-dive into coding on any machine. This talk will introduce the concept of a self-contained repository - a repository that contains all relevant information for workstation/dependency configuration, build, debug, CI/CD, secrets (encrypted), docs and more that reside in the repository. We’ll explore the idea of a development container and how we can use free OSS technologies (like Git Docker & Docker Compose, Kubernetes, GPG, VSCode, Tilt…) to create a stable development environment that works out-of-the-box and provides all the power, speed and capabilities of modern development & cloud native tooling. Live code examples will be showcased as part of this talk. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Deploy N applications to N clusters using Argo CD ApplicationSet | Dewan Ahmed | Conf42 SRE 2021

Dewan Ahmed - Developer Advocate @ Red Hat As the scaling needs grow, you have to deploy your application to more than one Kubernetes cluster. How about deploying multiple applications across multiple clusters? This talk covers Argo CD + ApplicationSets that allow you to manage deployments of a large number of applications, repositories, or clusters, all from a single Kubernetes resource. Starting with a brief introduction to GitOps and Argo CD, you will learn about the challenges of having to use new manifest files every time you want to deploy to another cluster. With Argo CD’s Application resource, users are limited to deploying from a single Git repository to a single cluster/namespace. In contrast, you’ll learn in this talk, the ApplicationSet resource uses templates, and automated generation of template parameters, to allow you to manage many Argo CD Applications simultaneously from multiple Git repositories. The demo will feature OpenShift GitOps which bundles Argo CD, ApplicationSets and other tools, to enable teams to implement GitOps workflows for cluster configuration and application delivery. Although some knowledge about Kubernetes and GitOps will help, the demo will explain the concepts in action. ApplicationSets Generator demos will show how your applications can be deployed and managed across single/multiple clusters from one or more Git repositories. You will leave the talk with the necessary resources and knowledge on Argo CD, ApplicationSets and OpenShift GitOps, and learn how these tools can help you manage large numbers of applications through templating and automation. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Use Voice AI For Incident Reporting, Monitoring & Alerts | Arthur Grishkevich | Conf42 SRE 2021

Arthur Grishkevich - Citizen Developer Advocate @ Dasha.ai From this talk you’ll learn how to automate incident reporting, monitoring, escalations and alerts using Dasha voice AI platform. Novel approach allows a SRE to automate all these workflows. Collect incident information, do follow-up calls, acknowledgement, escalation and more in 30 mins. As promised in the session - here is the source code with instructions on how to run it: https://github.com/dasha-samples/site-reliability-monitor-handler Let us know at https://community.dasha.ai if you have any questions or share feedback on how you were able to use your Dasha app. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Sustainable Incident Management for happy SRE teams | Ajuna Kyaruzi | Conf42 SRE 2021

Ajuna Kyaruzi - Developer Relations @ Datadog How you respond to production outages can affect both team morale and development velocity. With the proper Incident Response processes in place, it can reduce this stress, and make it easier to ramp up new teammates, and the focus on reducing TOIL. This talk will look at Incident Management at its core, covering Incident Command and how to scale it with a growing organization sustainably. We’ll go over common areas of pain for Incident Responders and how to ease them to reduce friction between Product and SRE teams such as best practices for playbooks, on-call rotations, error budgets, postmortems and incident communication to streamline incident resolution. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Don't Panic! Effective Incident Response | Quintessence Anx | Conf42 SRE 2021

Quintessence Anx - Developer Advocate @ PagerDuty Incidents are the pits. They’re frequently unexpected and many of us scramble to respond to the alarm. But what if you could refine your response process so that it felt routine? Today I’ll be talking about how to do just that - by adding formalized structure to your process. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Games We Play to Improve on Incident Response | Austin King | Conf42 SRE 2021

Austin King - Founder @ OpsDrill.com Incident Response is a core competency for SRE teams, but how can teams practice and improve? Inefficient incident response can be costly to a company. It causes lost revenue and destroys customer trust. We will discuss mainstream games, a conceptual frameworks for creating team specific drills, and finally introduce an innovative research topic - outage simulation. An outage simulator gives SRE teams a tool for practicing incident response. Core incident response skills are: severity triage, communication, delegation, and system familiarity. Drilling on these increases these skills, knowledge, efficiency, team cohesion and resilience. SRE Managers and ICs will learn why games such as “Keep Talking and Nobody Explodes” are played by many SRE teams. They will learn in detail how to create their own fire-drills from existing runbook entries. An overview of chaos testing and gamedays will be mentioned to provide the broader context. We will cover the pros and cons of each of these methods. Lastly, we will present the concept of an incident response simulator as an open research topic. The hypothesis will be that an incident simulator is a good trade-off between non-domain specific games and full-blown Gamedays. This is a talk illustrated with slides. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Fixing Broken Windows | Dmitry Vinnik | Conf42 SRE 2021

Dmitry Vinnik - Developer Advocate @ Facebook We all encountered a “Broken Window” theory in practice. The original idea was that if someone breaks a window in a neighbourhood and this window is not repaired right away, the entire area will start getting messier at an accelerated rate. The same theory is also true for Software Development. How many times have you looked at a legacy system with no code coverage, and decided not to write any tests because “”this is how we do things here”“? These bad practices behave just like those “”Broken Windows.”” They cause our code to degrade and become unusable. In this talk, we discuss how to break away from bad development practices and how to address major gaps in your legacy and current systems. We look at ways to successfully lead-by-example and to introduce refactoring culture into your team and organization. We cover tips and tricks that help to improve the development culture and to emphasize the general health of the codebase. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Improve Your Automation to Reduce Toil | Mandi Walls | Conf42 SRE 2021

Mandi Walls - DevOps Advocate @ PagerDuty In the course of your day as an SRE, your knowledge and expertise are in high demand. You can’t do every task every person in your org needs from you without the help of comprehensive automation. Automation can be tricky. Some systems aren’t built with automation in mind, but assume that a human being will be there to keep an eye on things and fix errors on the fly, and we can’t be everywhere when there’s too much to do. Plus, you want to provide access to automation for the right folks and keep a record of when the tools were used. In this talk, we’ll cover some things to keep in mind when you’re building out your automation library, characteristics of good automation, and give you a look at PagerDuty Rundeck, a platform that will help you share your expertise with other folks in your organization. Build automation that works for you and gives you your time back! Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Building Near Real-time Analytics Solution on AWS | Shubhankar Sumar | Conf42 SRE 2021

Shubhankar Sumar - Senior Solutions Architect @ AWS To create value, companies must derive real-time insights from a variety of data sources that are producing data at high velocity and volume enabling faster react in real time to events affecting business. The need for analysing heterogeneous data from multiple sources (internal/external) more than ever. Thereby making the analytic landscape ever evolving with numerous technologies and tools and making the platform more and more complex. Therefore, building a futuristic analytic solution is not only time consuming but costly involving selection of right stack, acquiring talent and ongoing platform management and monitoring. In this session, we’ll discuss and demo on how you can leverage AWS stack to create a near real-time analytics solution with minimum to no coding for an e-commerce website while an option to integrate with pre-existing data sources. The solution offers the following advantages: - easy to build - elastic and fully managed - high available and durable - seamless integration with AWS Services - pay for what you use Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Chasing the Grail | Dmitry Chuyko | Conf42 SRE 2021

Dmitry Chuyko - Senior Performance Engineer @ BellSoft JDK 16 features full musl support but doesn’t include AOT and Graal JIT. Is it all gone? Can you still write your own JVMCI compiler? The GraalVM licensing model is changing, and alternatives appear fast. In his talk, Dmitry Chuyko shows what you will achieve with native image. He’ll look into the practices of building tiny and performant microservice containers using Graal and the associated tools. The future is now. Don’t miss it! Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Engineering Reliable Mobile Applications | Pranjal Deo | Conf42 SRE 2021

Pranjal Deo - Engineering Program Manager @ Google Why Mobile and SRE (Site Reliability Engineering)? - Mobile is nonuniform and uncontrollable - SRE responsibilities differ in critical ways from infrastructure or server-side application reliability engineering - Focus on where your end users are (i.e. today many users access services through mobile applications) - Specific challenges of mobile * Monitoring * Release management * Incident management Case studies - Doodle outage, etc. Future of SRE for Mobile - Visibility into mobile application performance - Find and react to issues before the user does - Measure SLIs throughout the product stack (from client to service) Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Let the machines optimize the machines | Stefano Doni | Conf42: SRE 2021

Stefano Doni - CTO @ Akamas SREs’ main goal is to achieve optimal application performance, efficiency and availability. A crucial role is played by configurations (e.g. JVM and DBMS settings, container CPU and memory, etc): wrong settings can cause poor performance and incidents. But tuning configurations is a manual and lengthy task, as there are 100s of settings in the stack all interacting in counterintuitive ways. In this talk, we present a new approach that leverages machine learning to find optimal configurations of the tech stack. The optimization process is automated and driven by performance goals and constraints that SREs can define (e.g. minimize resource footprint while matching latency and throughput SLOs). We show examples of optimizing Kubernetes microservices for cost efficiency and latency tuning container sizing and JVM options. With the help of ML, SREs can achieve higher application performance, in days instead of months, and have a lot of fun in the process! Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Pitfalls of Infrastructure as Code (And how to avoid them!) | Tim Davis | Conf42 SRE 2021

Tim Davis - DevOps Advocate @ env0 Are you looking to start your journey into Infrastructure as Code? Or have you already jumped in head-first? Either way, this session is for you! We’ll talk about many of the common pitfalls of IaC, and how you can avoid them. We’ll go over: * What IaC is * Types of pitfalls you may have * Infrastructure pitfalls * Coding pitfalls * Basic mitigation strategy for each pitfall We’ll go over all kinds of things that you may or may not have even thought of yet. Get your questions ready, because I’m here to help you be successful in your IaC journey! Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Enterprise SRE adoption framework | Vishnu Vardhan Chikoti | Conf42 SRE 2021

Vishnu Vardhan Chikoti - Senior SRE Manager @ Fanatics This talk is about a new enterprise SRE adoption framework, named Arctic. Given the growing focus on infrastructure and service/application reliability, more and more enterprises are adopting Site Reliability Engineering (SRE). It will be beneficial for enterprises to use a framework for SRE adoption like Scrum, XP or Kanban that exists for Agile adoption. Without the availability of framework(s) to help in adoption, it will be challenging for enterprises as they need to spend a lot of effort upfront to understand how to go about the SRE adoption and do the planning before they begin the actual journey. This talk includes the following things. - The two pillars of the framework - Other frameworks/concepts that can go hand-in-hand with this - What to look for when hiring SREs - both in terms of personality types and skill sets - A way to do the goal setting for the transformation. It is to be noted that as on the date of submission for this talk, this framework has not been used in any enterprise and has been conceptualised very recently. The hope is to seed the thought around frameworks for SRE adoption, present the current version of this framework to larger SRE community, gather feedback and start the usage of this framework by enterprises. This talk suits various audience - who have already started their SRE journeuy, those who are looking to start on it and even those who are still exploring to understand more about SRE. What is the problem that I am trying to address? Currently, there is no standard framework for SRE adoption similar to the frameworks like Scrum, XP, Kanban, etc that exist for Agile Adoption. Having a standardised framework will eliminate quite a bit of upfront effort thinking about “how to adopt SRE” at enterprises. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Microsoft SQL Server HA/DR on AWS | Asif Mujawar | Conf42 SRE 2021

Asif Mujawar - Specialist Solutions Architect - Database @ AWS In this talk, we will go through several options for high availability and disaster recovery for MSSQL Server, making it one of most popular relational database technologies for enterprises’ mission critical workloads. However, these options are primarily designed for traditional on-premises environments. This session explains those options in the context and Cloud deployments, pros and cons of each, and what other possibilities emerge in a modern Cloud platform, that otherwise would have been unthinkable when on-premise Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
5 Levels of High Availability: from Multi-instance to Hybrid Cloud | Rafal Leszko | Conf42 SRE 2021

Rafal Leszko - Cloud-Native Tech Lead @ Hazelcast Does running your application on multiple machines mean it’s highly available? Technically yes, but the term HA is already more than that. Take Kubernetes installation, if you install it on AWS, it’s not considered HA unless your master nodes are in different availability zones, not only on different machines. And actually, there is much more on that topic. In this session I’ll present 5 high availability levels: * 1. Multi instance * 2. Multi zone * 3. Multi region * 4. Multi cloud * 5. Hybrid cloud I’ll discuss real-life use cases we experienced while developing Hazelcast and present examples of the related technical features you may need: in-memory partition backups, zone aware partition groups, WAN replication. In this session you’ll learn: * Why Kubernetes can be deployed in multiple zones but never in multiple regions? * What options you have while designing for high availability (for both Cloud and On-premise infrastructures)? * What are the trade-offs when choosing between high availability and strict consistency? * What are the best practices for deploying consistent systems in Hybrid Cloud? Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Pragmatic SRE for Kubernetes in the Cloud | Joshua Arvin Lat | Conf42 SRE 2021

Joshua Arvin Lat - CTO @ NuWorks Interactive Labs Managing infrastructure resources in the cloud becomes more challenging once we start to deal with significantly more resources and complex integration requirements. In this session, we will discuss the different solutions when dealing with production SRE requirements for Kubernetes in the cloud. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Top new CNCF projects to look out for | Annie Talvasto | Conf42 SRE 2021

Annie Talvasto - Cloud-Native Technology Marketer @ Cast AI The Cloud Native Computing Foundation (CNCF) bought you such fan favorites like Kubernetes & Prometheus. In this talk Annie Talvasto will introduce you the most interesting and coolest upcoming CNCF tools and projects. This compact and demo-filled talk will give you ideas and inspiration that you can 1) discover new technologies and tools to use in your future projects as well as 2) be the coolest kid in the block, by being up to date with the latest and greatest. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Migrating a monolith to Cloud-Native (and the stumbling blocks) | JJ Asghar | Conf42 SRE 2021

JJ Asghar - Developer Advocate @ IBM So your company has finally decided to move to the Cloud Native ecosystem. You’ve landed on containerization as your first step. You heard that all you needed to do was containerize your first app and then push it to Kubernetes/OpenShift/Nomad, and the cost savings just come. You’ve done this, and well, things have gone not as planned. Some of the tech didn’t do what you expected, and wait, what do you mean our OpEx has gone up? Simply said: the promise of containerization or migrating to the Cloud Native ecosystem can be a lie if you don’t do your homework. Sadly most companies don’t. In this talk, I’ll explain a few gotchas that a “few” enterprises, in the guise of AsgharLabs, hit moving towards the Cloud Native world, and hopefully, you’ll learn from their mistakes, so you’re trip down this path will be more comfortable and closer to the promise. Outline Introductions - What is AsgharLabs and where they started, what they thought they needed to do - Where I came into the conversation to help AsgharLabs - Questions you should ask after getting your app containerized - Where are the architectural advantages and disadvantages? - Are we doubling up on things? - Isn’t automation good here? Why is this thing so complicated now? - Questions you should ask about the cultural shift that will happen - How the economics of the Cloud can differ from your Datacenter - What do you mean our support is now Stack Overflow? - What do you mean our goal is to move away from the CCB? - Some tangible things you can start with to help become more successful - Build that pipeline extension - Collaborate with other teams - Visibility and Monitoring - Conclusion and where you can go from here Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Reliability Nirvana | Daniel Selans | Conf42 SRE 2021

Daniel Selans - Co-Founder @ Batch You’ve heard of event-driven architectures but what do they actually look like? Learn what is involved in building and operating an event-driven distributed system and why your next high-scale project should make use of this exciting tech. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Reading 100+ Kubernetes Post-Mortems - lessons learned | Noaa Barki & Shimon Tolts | Conf42 SRE 2021

Noaa Barki - Full-stack Developer Shimon Tolts - CEO @ Datree A smart person learns from their own mistakes, but a truly wise person learns from the mistakes of others. When launching our product, we wanted to learn as much as possible about typical pains in our ecosystem, and did so by reviewing many post-mortems (100+!) to discover the recurring patterns, anti-patterns, and root causes of typical outages in Kubernetes-based systems. In this talk we have aggregated for you the insights we gathered, and in particular will review the most obvious DON’Ts and some less obvious ones, that may help you prevent your next production outage by learning from others’ real world (horror) stories. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Designing the International Space Station for Reliability | Robert Barron | Conf42 SRE 2021

Robert Barron - AIOps, ChatOps & SRE @ IBM The International Space Station has been orbiting the Earth for over 20 years. It was not launched fully formed, as a monolith in space. Instead, it is built out of dozens of individual modules, each with a dedicated role - life support, engineering, science, commercial applications and more. Each module (or container) functions as a microservice, adding additional capabilities to the whole. Not only do the modules need to function together, delivering both functional and non-functional capabilities, they were designed, developed and built by different countries on Earth and once launched into space (deployed in multiple different ways), had to work together - perfectly. Despite the many (minor) reliability issues which have occurred over the decades, the ISS remains a highly reliable platform for cutting edge scientific and engineering research. In this session I will describe the way the space station was developed and the lessons Site Reliability and DevOps Engineers can learn from it. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
GitOps: yea or nay? | Ricardo Castro | Conf42 SRE 2021

Ricardo Castro - Senior SRE @ FARFETCH GitOps is a paradigm or a set of practices that empowers developers to perform tasks which typically (only) fall under the purview of operations. It’s a way to do Kubernetes cluster management and application delivery by using Git as a single source of truth for declarative infrastructure and applications. Being Git at the center of delivery pipelines, engineers use familiar tools to make pull requests to accelerate and simplify both application deployments and operations tasks to Kubernetes. GitOps software agents (e.g. ArgoCD, Flux and Jenkins X) can alert on any divergence between Git with what’s running in a cluster, and if there’s a difference, Kubernetes reconcilers automatically update or rollback the cluster depending on the case. This talk will include a demo of ArgoCD/Flux/Jenkins X on how to configure and use it to accelerate and simplify application deployments. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch
Fault isolation using shuffle sharding | Andrew Robinson | Conf42 SRE 2021

Andrew Robinson - Principal Solution Architect @ AWS Distributing user requests to resources in a combinatorial way to reduce the impact of failures and providing logical isolation. Other talks at this conference 🚀🪐 https://www.conf42.com​/sre2021 — 0:00 Intro 1:16 Talk

Watch