The Hunt for the Cluster Killer Bug | Dániel Szoboszlay | Code BEAM Europe 2022
This video was recorded at Code BEAM Europe 2022 - https://codesync.global/conferences/code-beam-sto-2022/ Working Effectively With Erlang Legacy Code | Brujo Benavides - Erlang Developer at NextRoll ABSTRACT We know Erlang is all about fault tolerance. A well-engineered Erlang system – such as Kred, in the heart of Klarna’s business – will never stop, no matter what. Yet, about a year ago a short Kafka outage shook our mighty Kred so bad it knocked out all but one node. A few days later a second outage took down the entire cluster. How could this happen? This is the story of our hunt for the cluster-killer bug before it could strike again. It is a story of unexpected twists and descending to the deepest depths of the technology stack powering an Erlang application. OBJECTIVES Give some new tools for debugging low-level issues in an Erlang stack. Teach about Erlang’s memory model. AUDIENCE Developers who would like to add some new tricks to their debugging toolbox. • Timecodes 00:00 - 04:39 - Intro and Fault Tolerance 04:40 - 08:27 - System Architecture 08:28 - 09:15 - Troubleshooting 09:16 - 13:29 - Identify 13:30 - 15:26 - Fix 15:27 - 20:36 - Alert + Identify + Fix 20:37 - 21:32 - The incident 21:33 - 27:26 - Symptoms 27:28 - 29:45 - Validate 29:46 - 34:10 - The Path of Metrics 34:11 - 40:19 - Testing lock-ups 40:20 - 46:36 - The Mystery Term • Follow us on social: Twitter: https://twitter.com/CodeBEAMio LinkedIn: https://www.linkedin.com/company/27159258 • Looking for a unique learning experience? Attend the next Code Sync conference near you! See what's coming up at: https://codesync.global • SUBSCRIBE TO OUR CHANNEL https://www.youtube.com/channel/UC47eUBNO8KBH_V8AfowOWOw See what's coming up at: https://codesync.global