Lessons in building resilient systems at Amazon and Meta | Zuodong Xiang | Conf42 IM 2024
Read the abstract ➤ https://www.conf42.com/Incident_Management_2024_Zuodong_Xiang_52_resilient_systems_amazon Other sessions at this event ➤ https://www.conf42.com/im2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/mvHyZzRGaQ Chapters 0:00 Introduction: What Can Possibly Go Wrong? 0:58 Real-Life Scenario: Flood of Traffic 3:01 Real-Life Scenario: Retry Storm 5:10 Real-Life Scenario: Plan B Went Poorly 6:24 Real-Life Scenario: Bad Commit 8:21 Real-Life Scenario: Lack of Sufficient Ownership 9:21 Real-Life Scenario: Script Errors 10:20 Prevention Strategies: Defensive Coding Practices 11:09 Logging and Error Handling Best Practices 12:35 Setting Effective Alerts 15:04 Mitigation Strategies for Alerts 15:46 Preparing for High Velocity Events 17:27 Conducting a Self Review 19:42 Conclusion and Takeaways