Conf42 Chaos Engineering 2020
2020
List of videos

How to be WRONG | Russ Miles | Conf42 Chaos Engineering 2020
Russ Miles CEO @ ChaosIQ Slides: https://580d9e60-4356-475f-aa71-085d1b84e2cf.filesusr.com/ugd/f3e158_71f38b7241f3468db93d89e19ec8c12f.pdf LinkedIn: https://www.linkedin.com/in/russmiles/ Being wrong is often seen as the WORST THING THAT CAN HAPPEN(tm), especially when you’re build business critical applications and services. But the increased velocity of modern software development, plus the increased need for our systems to be resilient, reliable, and RIGHT has increased the pressure on developers exponentially. Never before have software owners had such an opportunity, or the power, to BE WRONG! We need to get better at being wrong, and that’s what this keynote is all about. In this keynote talk Russ Miles, CEO of ChaosIQ,, will share the tools and techniques he uses to turn inevitably BEING WRONG, into BEING SUCCESSFUL at BEING WRONG. BEING WRONG can be turned to our advantage, and in this talk Russ will share stories of how this has happened, and also the challenges to look out for. The myth of always being right when you create and operate software is over! You’re going to BE WRONG most of the time’s time to get better at BEING WRONG, learning to turn “accidents” such as outages into opportunities… — 🥇 Gold Sponsors: ChaosIQ PagerDuty — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch
Chaos Monkey for Spring Boot | John Fletcher & Manuel Wessner | Conf42 Chaos Engineering 2020
John Fletcher & Manuel Wessner Chaos Monkey Evangelists Everything you want to know about the useful and popular chaos engineering tool Chaos Monkey for Spring Boot (CMSB) from two of its maintainers. Featuring: - How to easily get started with your first Chaos Experiments. - More exotic applications like dual attacks. - How to integrate CMSB with automation tools like Chaos Toolkit and Chaos Mesh in order to run tests in your build chain. - An overview of the history and the changes in the latest version. - Who should get involved in the project, and how. - A sneak peek into the next release. The talk consists primarily of live coding. — 0:00 Preamble 0:26 Live Demo - https://github.com/fletchqc/mediator 18:33 Project history 20:18 Get involved - https://github.com/codecentric/chaos-monkey-spring-boot 20:35 Recent Changes - Scheduling Runtime Assaults Via Cron & Memory Assault 22:31 Roadmap / Upcoming 23:36 When CSMB just doesn't cut it... 24:34 Alternative Tools: Traffic Control (TC) & Stress CPU 25:42 Pumba 26:24 And everything else... like KubeInvaders! 27:32 Chaos Blade 28:39 Chaos as a Service: ChaosIQ & chaosmesh 30:16 Thank you! — 🥇 Gold Sponsors: ChaosIQ PagerDuty — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch
Application-Level Chaos Engineering in JVM | Long Zhang | Conf42 Chaos Engineering 2020
Long Zhang PhD Student in Computer Science @ KTH Royal Institute of Technology LinkedIn: https://www.linkedin.com/in/gluckzhang/ During the talk, I will introduce the main research work we did recently on chaos engineering. We focus a lot on application-level chaos engineering in JVM. For example, ChaosMachine provides unique and actionable analysis on exception-handling capabilities in production, at the level of try-catch blocks. TripleAgent combines monitoring, perturbation, and failure-obliviousness for automated resilience improvement, at the level of methods. Currently, we are exploring a new idea about chaos experiments for containerized Java applications. When the conference is held, it is promising to share some interesting findings of this work as well. For sake of open-science, the code is made publicly available at https://github.com/KTH/royal-chaos — 0:00 Preamble - The Space of Chaos Engineering 1:37 Royal-Chaos @ Github - https://github.com/KTH/royal-chaos 2:03 Chaos Machine - https://arxiv.org/abs/1805.05246 5:02 The Overview of ChaosMachine 6:15 ChaosMachine - Hypotheses 8:26 - What Can be Learned 9:02 - Experiments on TTorrent 10:10 TripleAgent - https://arxiv.org.abs/1812.10706 11:33 A Chinese Kungfu in Chaos Engineering - httops://en.wikipedia.org/wiki/Zhoun_Butong 12:39 TripleAgent - Example 12:39 - Evaluation 15:50 - Overhead 16:17 POBS - https://arxiv.org/abs/1912.06914 + Quick command 18:24 - Empirical Study 19:24 - Design 20:42 Demo time! 23:13 Summary 24:04 Thanks for listening! Reach out: longz@kth.se — 🥇 Gold Sponsors: ChaosIQ PagerDuty — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch
Journey to Resilience | Vilas Veeraraghavan | Conf42 Chaos Engineering 2020
Vilas Veeraraghavan Director of Engineering @ Walmart Labs Chaos engineering has come a long way from its early days at Netflix. Its importance is no longer questioned in the community but as it has gone mainstream, teams quickly learn that adoption is not a given. In this talk, we talk about the challenges that we encountered at Walmart and the techniques used to break through them. We will discuss our successes and failures on the journey to resilience, highlighting the major barriers to adoption. The talk also will discuss the strategies we used to build tooling to guide teams in addition to a gamified approach to motivate them. — 0:00 Preamble 0:30 Chaos Engineering 0:50 Walmart in numbers 1:46 Where de we start the journey? 3:52 Goal 4:19 Every Outage is a Chaos Exercise 5:11 Downtime is Ecpensive 7:10 The Homework 7:25 Observability 8:53 More On-Call Prerequisites 11:04 Generating production-like Load 12:29 CI/CI Workflow - invest in it 14:34 Build a Maturity Model 15:11 Support Costs 15:51 Build the Right Tools 16:34 Build the Right Mindset 18:02 What we learnt on our Journey? 18:19 Eliminate Vanity Positions 19:35 Don't Assume. Verify. 21:30 Are Teams ready for Exercises? 22:24 Where are we now? - Report card 24:45 Thank you! — 🥇 Gold Sponsors: ChaosIQ PagerDuty — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch
Who is responsible for Chaos? | Joyce Lin | Conf42 Chaos Engineering 2020
Joyce Lin Developer Advocate Lead @ Postman If you’re thinking of starting a chaos program, you might be wondering which job functions are typically responsible for managing chaos within their organizations. This talk will look across a number of companies to determine who historically initiates chaos programs, as well as reveal new trends in this space. — 0:00 Preamble 0:50 Who is responsible for chaos? 1:07 Which job titles are doing chaos? 1:23 By job title - diagram 1:53 Quote by Kolton Andrus (CEO @ Gremlin) 2:18 Resposibilities 4:14 Roles 4:35 Why aren't testers doing chaos? - Chaos Testing 5:00 Software Development Lifecycle 5:14 It was called chaos testing 5:45 Quote by Abby Bangser (PTE @ MOO) 6:29 Testers doing chaos?! 7:15 Who can start a chaos program? 8:22 Do I need to wait for a catastrophe 8:34 Quote by Casey Rosenthal (CEO @ Verica) 9:02 Final thoughts 11:08 Thank you! + Sources — 🥇 Gold Sponsors: ChaosIQ PagerDuty — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch
Post Mortem Culture: Learning from Failure | Yury Niño Roa | Conf42 Chaos Engineering 2020
Yury Nino Roa DevOps Engineer @ Aval Digital Labs Practicing Chaos Engineering and reproducing outages have taught us that the culture of postmortems must be open and blameless. That is difficult, in part, due to the social stigma associated with publicly acknowledging the contributions of persons to outages. And although the scenarios simulated in a gameday are entirely realistic, it's hard to write-up postmortems that resume all events, hint human factors, recognize there is not a root cause and provide action items. In Aval Digital Labs, we are implementing a toolbox that automates the steps involved in chaos game days and generates postmortems using available in the market. — 0:00 About me 1:17 Have you written a Postmortem? 1:53 Agenda 2:29 What is a Postmortem? 3:53 If Postmortem are good, why don't we do it? 5:28 How to change a blameful culture? 5:43 Chaos Engineering 6:01 Chaos GameDays 7:12 What does it mean in practice? 9:00 Gaveta 11:50 Gaveta uses a hexagonal architecture 19:00 Promoting postmortem culture 19:35 Thank you! — 🥇 Gold Sponsors: ChaosIQ PagerDuty — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch
Shipping Quality Software in Hostile Environments | Luka Kladaric | Conf42 Chaos Engineering 2020
Luka Kladaric Founder & Chaos Management @ Sekura Collective Everyone loves features, right? Product loves features. Management loves features. The board loves features. Features are what make the users use and the investors invest, right? They certainly make the media pay attention. What happens when, for 8 years straight, all you care about is features? Productivity grinds to a halt, production outages are a given, post-mortems are a joke and job satisfaction and happiness are flatlining. Lessons learned unravelling layers and layers of terribleness to rediscover productivity and job satisfaction while also improving security and robustness of the products. — 0:00 Who am I? 0:45 Hostile Environments? 1:08 Tech Debt? 2:30 Where does it come from? 4:55 What's the Harm? 7:09 Case Study 12:34 How do you even begin to fix this? 17:18 The Moral of this Story 18:48 How do we do Better? 20:22 Time for a New Approach 21:20 New Name: Sustainability Work 21:54 Budget VS Planning 22:51 Helps with Morale 22:59 "Sustainability Work" over "Tech Debt" — 🥇 Gold Sponsors: ChaosIQ PagerDuty — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch
Chaos Engineering for SQL Server | Andrew Pruski | Conf42 Chaos Engineering 2020
Andrew Pruski SQL Server DBA @ Channel Advisor Slides: https://580d9e60-4356-475f-aa71-085d1b84e2cf.filesusr.com/ugd/f3e158_da0c445802b346f08f85bef1310bd909.pdf LinkedIn: https://www.linkedin.com/in/andrewpruski/ In this session we’ll look at how Chaos Engineering can be implemented with regard to SQL Server. SQL has various different high availability solutions but can we be sure that they’ll react as expected to a real world issue? Has the HA architecture only ever been tested in a planned maintenance window? We’ll explore SQL Server’s built-in high availability features and take a look at Kubernetes, a brand new platform for SQL Server. We’ll also have some fun by looking at KubeInvaders, a chaos engineering tool for Kubernetes…using Space Invaders! — 0:00 About me 0:45 Session Aim 1:47 Agenda 2:39 Identifying Weaknesses - Incident Analysis 4:04 Likelihood - Impact Map 12:10 Defining and Experiment 13:15 Running an Experiment 14:45 Demo 22:50 SQL Server running on Kubernetes 24:14 KubeInvaders 28:08 Resources — 🥇 Gold Sponsors: ChaosIQ PagerDuty — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch
Getting out of the Starting Blocks | Adrian Hornsby | Conf42 Chaos Engineering 2020
Adrian Hornsby Principal Technologist, Architecture @ Amazon Web Services (AWS) LinkedIn: https://www.linkedin.com/in/hornsby/ Architectures are growing increasingly distributed and hard to understand. As a result, software systems have become extremely difficult to debug and test, which increases the risk of failure. With these new challenges, chaos engineering ha become attractive to many organizations as a mechanism for underling the behavior of systems under expected circumstances. Whilst interest is growing, few have managed to build sustainable chaos engineering practices. In this talk, I will review the state of chaos engineering, the issues customers are facing, based on my learning as an AWS Solution Architect and Technologist focusing on Chaos Engineering and explain why I started to build tools to help with failure injection. — 0:00 Preamble 1:55 What prevents the wide adoption of chaos engineering? 2:54 Why is production chaos? 3:45 #0 - (Don't) call it (Chaos) Engineering 4:47 #1 - Look at the Bigger Picture 13:20 #2 - Change begins with understanding 24:30 #3 - Choose your Trojan Horse 28:16 #4 - Over-index on the Hypothesis 32:15 #5 - Introduce Chaos Engineering Early in the Journey 35:31 #6 - Blast-Radius Reduction Mindset 36:23 #7 - If you haven't verified it, its probably Broken 39:43 https://github.com/adhorn 40:50 Getting out of the starting blocks 43:46 Thank you! — 🥇 Gold Sponsors: ChaosIQ PagerDuty — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch
Cloud Native Chaos Engineering | Umasankar Mukkara | Conf42 Chaos Engineering 2020
Uma Mukkara Co-Founder and COO @ MayaData The cloud native approach has taken the DevOps world by a pleasant surprise by the welcome adoption of Kubernetes across all categories - from Developers to SREs to VP of digital transformation. As the huge mass of legacy applications move Cloud-Native platforms, an important problem arises. How do SREs make sure the systems do not have weaknesses and have the required level of resilience? A well thought out chaos engineering methodology is the right answer. And for a large number of fast-changing applications and infrastructure, finding the right set of chaos experiments and identifying if the impact of chaos has resulted in showing up a weakness in the system is almost an impossible task. In Cloud Native Chaos engineering, the developers develop chaos tests as an extension of the development process. These tests are developed using standard Kubernetes Custom Resources or CRs so that they are easier to manipulate according to the environment. These chaos experiments are groomed in CI pipelines and finally published in the Chaos Hub so that they are available to SREs using the Cloud-Native applications in production. SREs use such chaos experiments of various microservices to schedule chaos in a random fashion to find weaknesses in their deployments, which leads to increased reliability. — 0:00 Preamble 1:20 Chaos Eng is... 3:14 Agenda 4:28 Cloud Native Chaos Engineering 5:48 Cloun Native environment(s) 11:45 Cloud Native Chaos Engineering 13:31 Principles 12:23 Litmus project 21:44 Cloud Native Chaos Engineering - Example 22:36 ChaosHub 26:55 How can you contribute — 🥇 Gold Sponsors: ChaosIQ PagerDuty — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch
Psychology of Chaos Engineering | Matty Stratton | Conf42 Chaos Engineering 2020
Matty Stratton DevOps Advocate @ PagerDuty Chaos Engineering, failure injection, and similar practices have verified benefits to the resilience of systems and infrastructure. But can they provide similar resilience to teams and people? What are the effects and impacts on the humans involved in the systems? This talk will delve into both positive and negative outcomes to all the groups of people involved - including users, engineers, product, and business owners. Using case studies from organizations where chaos engineering has been implemented, we will explore the changes in attitude that these practices create. This talk will include a brief overview of chaos engineering practices for unfamiliar members of the audience, but the main focus will be on human elements. I will discuss successful implementations, as well as challenges faced in teams where chaos was a “success” from a technical perspective, but contained negative impact for the people involved. — 0:00 Preamble 2:24 let's set some agreement 3:08 Chaos Engineering - definition 3:40 Chaos @ Netflix 4:57 perceptions 5:29 "isn't all engineering chaotic?" 6:19 it's not about breaking things 6:43 look, I know you know this 7:30 I'm gonna say it anyway 7:44 these are experiments 8:59 how we talk about things matter 12:00 people get nervous 12:31 You want to do 'what' in production?? 13:30 use your monitoring like it's for real because it is 15:43 but what about people? 16:21 what do you feel knowing Netflix uses chaos engineering? 16:42 what about your bank? 16:58 blast radius - twitter pole 16:24 data, such as it is 18:32 management can get... nervous - consider your words 19:34 it's all about philisophy 21:59 safety first 25:05 - https://speaking.mattstratton.com — 🥇 Gold Sponsors: ChaosIQ PagerDuty — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑🤝🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y
Watch