Running an effective incident management process | Nishant Roy | Conf42 Incident Management 2022
When working on technical systems, it's inevitable that something will at some point break. Therefore, it is extremely important to be prepared for how to handle such a situation, and ensure you and your team are doing what you can to minimize the downtime for your users and/or customers. Outages and similar drops in availability can occur in any system, whether it's user facing, employee facing, revenue generating, or recommendation generating. By minimizing the time taken to respond to and resolve such an incident, you can make sure to minimize the impact to your topline, customers, and users. This is a tech talk intended to share some ideas with developers, SREs, and managers, on how to run a healthy incident management process within your organization. When things go wrong, it's extremely important to remain calm, reduce and resolve the impact, retroactively identify the top learnings, and follow up to make your systems more resilient to such outages in the future. As the engineering manager for the Ads Serving Platform team at Pinterest, which owns multiple business critical services, and a member of the incident manager oncall for Pinterest, I have experienced several high-severity incidents, and have learned a lot from this process. I hope my learnings can be of use to others in similar positions to run an effective incident response process. By the end of this talk, the audience should be able to answer the following questions: 1. What differentiates an incident from a bug? 2. How can we empower our team(s) to reduce adverse impact to our customers/users/business? 3. How can we build a healthy culture around learning from our mistakes and growing together? 4. How to measure and track improvements to our incident management process? Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 1:40 Talk