Conf42 Incident Management 2022

2022

List of videos

Premiere - Conf42 Incident Management 2022

Conf42 Incident Management starts here! Schedule, Lineup & RSVP: https://www.conf42.com/im2022 Join Discord to interact: https://discord.gg/DnyHgrC7jC 0:00 sponsored segment 1:00 intro, sponsors & partners keynotes 1:59 Ryan McDonald 2:26 Vanessa Huerta Granda 3:11 Nick Mason & Emily Arnott 4:04 Ankit Jain 4:45 Jeff Nickoloff 5:22 Lisa Karlin Curtis culture 5:43 Hila Fish 6:30 Mandi Walls 7:05 Amir Shaked tools 7:30 Ricardo Castro 8:19 Valera Bronshtein deep dive 9:01 Chris Nesbitt-Smith 9:29 Hari Krishnan 10:02 Nishant Roy 10:42 Eran Levy lessons learned 11:21 Brian Contos 12:17 Kurt Andersen 13:05 panel Discussion teaser: Nora Jones, Erin McKeown & Charity Majors 13:25 Thank you, Join our Discord to interact! https://discord.gg/DnyHgrC7jC

Panel Discussion | Nora Jones, Erin McKeown & Charity Majors | Conf42 Incident Management 2022

Nora Jones, Erin McKeown & Charity Majors will discuss the Incident Management practices in their respective organizations. Other talks at this conference 🚀🪐 https://www.conf42.com/im2022

Incidents: the customer empathy workshop | Ryan McDonald | Conf42 Incident Management 2022

Organizations are focusing on incidents more than ever but failing to leverage them to their full potential. But by framing incidents and post-incident reviews as customer empathy-building opportunities, we can facilitate more creative technical problem-solving, unlock improvements to your response process, and enable organizational agility that otherwise might have gone unnoticed. This talk will deliver actionable methods to increase customer empathy before, during, and after an incident. Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 1:40 Talk

One more step in Learning from Incidents | Vanessa Huerta Granda | Conf42 Incident Management 2022

One more step in Learning from Incidents: Sharing incident findings effectively Oftentimes post-incident activities involve a post-mortem meeting and document. These 2 vary in quality, from focusing only on a single root cause, maybe a 5-why’s, or on the bright side, a thorough investigation that takes into account multiple points of view. Regardless, once the meeting is complete, the output usually ends up in a document hidden in a drive that no-one ever opens and the knowledge that was discovered during the postmortem stays only with those who attended the meeting. If learning from incidents allows us to turn outages into opportunities, how can we make those learnings reach the most people? We do this by carefully and thoughtfully sharing our findings. By sharing our findings we allow for more equitable learning (account for people with conflicts or illnesses who couldn’t attend the review meeting), can get buy-in for next steps, get people in the org to have a more well-rounded knowledge of how things work. This talk is meant for anyone who is involved in incidents in any way: responders, subject matter experts, impacted users, and facilitators but mostly for technologists who want to make the most out of their incidents through learning! Throughout this talk I will give an overview of why sharing matters and go deep through the different ways that we can share incident learnings depending on your needs and audiences as well as provide examples with the hope that folks are able to start applying these forms of communication in their own orgs. Furthermore, I want folks who watch this talk to leave with a sense that change can happen and that we aren’t meant to keep repeating the same problems over and over again. Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 0:39 Talk

Too many people in the room? | Nick Mason and Emily Arnott | Conf42 Incident Management 2022

When something goes wrong, it can be tempting to gather as many people as you can to fix it. Each person can contribute tremendous value through diverse viewpoints, but too many people can overcrowd your response, leading to miscommunication, redundant work, and much more. This talk will teach you to avoid overcrowding incidents through smarter escalation policies, role-based tasks to organize efforts, and more efficient communication. A lean, focused team of relevant players can achieve much more than a bloated, confused one. Only then will you start to reduce the burden for your on-call team and keep customers happy. Blameless is drawing! Participate in a raffle to win Beats Solo Wireless: https://info.blameless.com/raffle-conf42-incident-management Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 0:39 Talk

Automate merging to keep builds healthy at scale | Ankit Jain | Conf42 Incident Management 2022

Code-submission processes can highly impact developer productivity, especially as engineering teams scale and codebase complexity grows. Often, teams that work on a monorepo struggle with keeping their main branch stable, especially as the number of engineers merging changes (and consequently, the number of code-submissions per day) grows. This happens because incompatibilities emerge when multiple changes are combined, causing builds to break frequently. This in-turn cases costly rollback and blocked deployments and hours of engineering lost. Poly-repo setups present their own challenges: synchronizing merges when changes span multiple repositories, rolling back related changes across repos, and testing across multiple build/test pipelines can become coordination time-sinks for developers. This talk will feature a distillation of various merge strategies that help teams scale, and their associated developer-productivity trade offs. Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 1:40 Talk

Get Ready to Recover with Reliability Management | Jeff Nickoloff | Conf42 Incident Management 2022

The best way to minimize the impact of incidents is to prepare to respond ahead of time. But it is difficult and expensive to prepare for every possible symptom or cause. This talk will cover how to test the reliability of your system, evaluate your incident readiness, and prioritize future preparation. Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 1:40 Talk

Using incidents to level-up your teams | Lisa Karlin Curtis | Conf42 Incident Management 2022

Incidents are a great opportunity to gather both context and skill. They take people out of their day-to-day roles, and force ephemeral teams together to solve unexpected and challenging problems. The first part of the talk will walk through the different things you can learn from incidents, including: - Taking you to the edges of the systems your team owns, and beyond - incidents help broaden your understanding of the context in which you're building - Showing you how systems fail, so you can learn to identify and build software with good observability, and considerations of failure modes - Expanding your network inside your organisation, making connections with different people, who you can learn from and collaborate with We'll then talk about how to get the best value from the incidents which you do have as an individual, thinking about when is an appropriate time to ask questions, and how to get your own learnings without 'getting in the way'. Finally, we'll discuss how to make this part of the culture of an organisation: as part of the leadership team, what can you do to encourage this across your teams? Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 1:40 Talk

Incident Management - Talk the Talk, Walk the Walk | Hila Fish | Conf42 Incident Management 2022

Remember when we were at school, and people said - ""Actively listening in class guarantees 50% prep for the upcoming test""? The same goes for being proactive at work in ways that will instantly prepare you to manage incidents better (at night or in general). In this talk, I will lay out the foundations of incident management, including key questions that if you'll be able to answer - You will be able to easily manage incidents, no matter the time and place. I will also show the best practices I've finalized over the years that helped me get a clear vision of how to manage production incidents in the quickest & efficient way possible. Embracing the tips I'll give you will guarantee you'll not only talk the talk but also walk the walk when it comes to incident management. Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 1:40 Talk

Plan for Unplanned Work: Game Days & Chaos Engr. | Mandi Walls | Conf42 Incident Management 2022

How do you plan for unplanned incidents? You practice with Chaos Engineering. Strong incident response doesn't just happen, you have to build the skills and train your team. Practicing for major incidents gives your team insight into how your applications will behave when something goes wrong as well as how the team will interact to solve problems. Combining your Incident Response practices with Chaos Engineering roots your response practice in real-world scenarios, helping your team build confidence. Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 1:40 Talk

Create a learning culture | Amir Shaked | Conf42 Incident Management 2022

Building and marinating a five 9s system isn’t just about the tools and technologies. Development culture has a big part in how you keep a system available while scaling it up and supporting more features, users, and locations. A healthy learning culture, supporting the development, not repairing mistakes, and identifying weak points is another tool in the engineering toolbox. In this talk, we will discuss how to create a learning culture using debriefs, what to avoid, and how to instill change in an engineering organization. Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 1:40 Talk

Relia...bility? | Ricardo Castro | Conf42 Incident Management 2022

Technology ecosystems are complex and it is really important to understand every change and how it affects our systems, as well as the service provided. Users expect systems to be up, responsive, fast, consistent, and reliable. Reliability for systems means that they are doing what their users need them to do. A system's reliability is essentially how happy users are and we know those happy users are good for business. If reliability is one of the most important requirements of any system, users determine what reliability means, and it’s okay to not be perfect all the time. We need a way of thinking that can address this way of thinking since we have limited resources to spend, be they financial, human, or political. Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 1:40 Talk

Build a low-cost CI/CD solution on top of AWS | Valera Bronshtein | Conf42 Incident Management 2022

Typical start-up build its initial infrastructure quick and dirty to get relevant and grow fast. Its awesome, but the fee comes later as not-bestpractised tools that consume a lot of time and money to manage them. In this session we will show how to switch from huge one-node Jenkins server to high-performance Jenkins fleet based on on-spot agents. Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 1:40 Talk

Policy as [versioned] Code | Chris Nesbitt-Smith | Conf42 Incident Management 2022

In this talk Chris will trace back the origins of how policies are often incepted, how it can get out of hand, be slow if not impossible to update and measure compliance, and often lead us to question of is the policy helping or hindering. From this talk you'll learn how to use a software development pattern and product ways of thinking towards how your organization can manage policy; achieve continual updates to policy allowing the risk mitigations to move as fast as the risk does, not get in the way and be easy to measure compliance. Key take aways: - Policy often causes more harm than good, is slow to update, exemptions are harder still to manage, measuring compliance at scale is near on impossible. - Throwing some curly braces at a problem is not the solution. Policy if it is articulated as code, needs to embrace all the best practices of code. - Purposeless policy is potentially practically pointless. (now say it 5 times quickly) Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 0:39 Talk

Contract Driven Development | Hari Krishnan | Conf42 Incident Management 2022

Our largest hurdle in deploying a MicroService was the Integration Testing stage. Just one incompatible API was enough to break the integration environment and block the path to production for all services. While adopting OpenAPI helped address some of the communication gaps in API specs between teams, the deviations during implementation continued to persist. We needed an approach that changed the way teams collaborated on API Specs and also remove the need for integration testing. To fill this need we came up with Contract Driven Development which consists of 1. Contract as Test - Contract (Example: OpenAPI) translated to Test Scenarios against the API implementation. Ensures that Provider (API implementation) adheres to Contract. 2. Smart Service Virtualisation - Verify Stub Data against OpenAPI Spec. Ensures the Consumer (API Client) is compatible with Provider's Contract. 3. Backward Compatibility Testing - OpenAPI vs OpenAPI (no code) to check if versions are backward compatible. Helps teams analyse if a change will break integration. Takeaways: 1. Issues with Integration Testing - The problem statement 2. Executable API Specifications - The role of API Specification Standards in eliminating Integration Tests 3. What is Contract Driven Development? Metrics to understand ROI Target Audience - CTOs / Heads of Engineering / Technology Leaders, Senior Engineers Pre-requisites - API Design Basics, Backward Compatibility, Service Virtualisation, Experience with Contract testing will be a bonus Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 1:40 Talk

Running an effective incident management process | Nishant Roy | Conf42 Incident Management 2022

When working on technical systems, it's inevitable that something will at some point break. Therefore, it is extremely important to be prepared for how to handle such a situation, and ensure you and your team are doing what you can to minimize the downtime for your users and/or customers. Outages and similar drops in availability can occur in any system, whether it's user facing, employee facing, revenue generating, or recommendation generating. By minimizing the time taken to respond to and resolve such an incident, you can make sure to minimize the impact to your topline, customers, and users. This is a tech talk intended to share some ideas with developers, SREs, and managers, on how to run a healthy incident management process within your organization. When things go wrong, it's extremely important to remain calm, reduce and resolve the impact, retroactively identify the top learnings, and follow up to make your systems more resilient to such outages in the future. As the engineering manager for the Ads Serving Platform team at Pinterest, which owns multiple business critical services, and a member of the incident manager oncall for Pinterest, I have experienced several high-severity incidents, and have learned a lot from this process. I hope my learnings can be of use to others in similar positions to run an effective incident response process. By the end of this talk, the audience should be able to answer the following questions: 1. What differentiates an incident from a bug? 2. How can we empower our team(s) to reduce adverse impact to our customers/users/business? 3. How can we build a healthy culture around learning from our mistakes and growing together? 4. How to measure and track improvements to our incident management process? Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 1:40 Talk

Use Chaos Engineering to improve incident response | Eran Levy | Conf42 Incident Management 2022

As engineers, we used to write code that was interacting with a well defined set of other applications. You usually had a set of services that were running in well defined environments. The evolution of cloud native technologies and the need to move fast, led organizations to redesign their structure. Engineers are now required to write services that are just one of many other services that usually solve a certain customer problem. Your services are smaller than what they used to be, they aren’t alone in a vacuum and you have to understand the problem space that your service is living in. These days engineers aren’t just writing code. They are expected to know how to deal with Kubernetes, HELM, containerize their service, ship to different environments and debug in a distributed cloud environment. In order to enhance engineers' cloud native knowledge and best practices to deal with production incidents, we started a series of workshops called: “On-Call like a king” which aims to enhance engineers knowledge while responding to production incidents. Every workshop is a set of chaos engineering experiments that simulate real production incidents and the engineers practice on investigating, resolving and finding the root cause. In this talk I will share how we got there, what we are doing and how it improves our engineering teams expertise. Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 1:40 Talk

Cameras & Clocks: Enterprise IoT Security Sucks | Brian Contos | Conf42 Incident Management 2022

Enterprise Internet of Things (IoT) security today is analogous to IT security in the mid 1990s. It was a time when security awareness was limited, countermeasures and best practices weren’t broadly applied, and attackers explored, compromised, controlled, and exfiltrated data from systems with minimal resistance. In short, enterprise IoT security sucks as bad today as that unpatched Windows NT 3.51 server with an RS-232 connected modem that IT forgot about. Working globally with Fortune 500 enterprises and government agencies we’ve interrogated over two million production IoT devices. Across these two million devices we’ve identified threats and trends, compiled statistics, summarized compelling cases, and evaluated common offenders. We’ve also assembled tactics that organizations can employ to recognize value from their IoT devices while minimizing risk and ensuring that devices that are secure today will stay secure tomorrow. Security issues are compounded by the quantity of IoT devices. Our analysis indicates that most organizations have about five IoT devices per employee. The global IoT market has grown from $100 billion in 2017 to over $1 trillion in 2022. There are over 46 billion connected devices today and 30 billion (65%) of those devices are IoT. We are increasingly dependent on consumer, enterprise, industrial, and military IoT devices for cost reduction, supply chain logistics, productivity gains, security, and everything in between. Despite the criticality of IoT, our security hasn’t kept pace. In the enterprise, we’ve identified that we simply don’t know: ● What IoT devices we have - guesses based on legacy asset discovery solutions are consistently off by at least 50% ● When our firmware was last updated - in many cases the firmware is end of life and the average IoT firmware age is six years ● If our credentials follow organizational policies - passwords that are default, low-quality, don’t have scheduled rotations, and lack centralized management are the norm ● How vulnerable our IoT devices are - at least half of the IoT devices we’ve interrogated have known, high to critical level CVEs While enterprise IoT security currently sucks, it doesn’t have to be that way. By evaluating the security risks and the inherent limitations of IoT, you can leverage tactics that will have a rapid and positive impact on security. Attendee takeaways: ● Discover your IoT devices, diagnose their security, and define their limitations ● Employ tactics to improve your IoT security and communicate their status to stakeholders ● Restate key findings derived from the interrogation of two million production IoT devices Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 1:40 Talk

Varieties of Incident Response | Kurt Andersen | Conf42 Incident Management 2022

Have you ever wondered if there was a better way to respond to incidents? When you are in the midst of an incident, does ""the process"" help you and your teammates or is it more of a burden? There have been a variety of approaches to organizing people and teams over the 30+ years of online services. Each of them have benefits and drawbacks. This talk will dive into a representative set of these approaches to examine them and help the audience to have a wider context by which they can evaluate their own arrangements for incident response. The talk will also look at incident response from a more abstract, task/intent-focused perspective to give a framework against which processes can be examined and adjusted to be more enabling, less burdensome. (And no, this is not a lite beer commercial ;-)) Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 0:39 Talk