Conf42 Site Reliability Engineering (SRE) 2024

2024

List of videos

Premiere - Conf42 Site Reliability Engineering (SRE) 2024

Support our mission ➤ https://www.conf42.com/support Schedule, Lineup & RSVP ➤ https://www.conf42.com/sre2024 Join Discord ➤ https://discord.gg/DnyHgrC7jC Upcoming CFPs ➤ https://www.papercall.io/events?cfps-scope=&keywords=conf42 0:00 Intro ai 0:53 Michele Dodic & Anastasia Archangelskaya - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Michele_Dodic_Anastasia_Archangelskaya_journey_nextgen_aio 1:30 Asutosh Mourya - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Asutosh_Mourya_futureproofing_integrating_resilience 1:53 Indika Wimalasuriya - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Indika_Wimalasuriya_amplifying_genai_reliability chaos 2:30 Peter De Tender - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Peter_De_Tender_stresstesting_azure_chaos 3:15 Hareesh Iyer - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Hareesh_Iyer_chaos_engineering_practical cloud 3:44 Ederson Brilhante - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Ederson_Brilhante_building_secure_flexible 4:31 Joshua Fox - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Joshua_Fox_wafs_web_application 5:04 Nikolay Sivko - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Nikolay_Sivko_zeroinstrumentation_observability_ebpf 5:45 Alex Dejanu - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Alex_Dejanu_k8s_strategies_ask culture 6:08 Jorge Luis Castro Toribio - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Jorge_Luis_Castro_Toribio_building_reliable_community 6:46 Evgenii Korneev - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Evgenii_Korneev_resilient_teams_blueprint 7:17 David Argent - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_David_Argent_avoid_agile_victim deep dive 7:53 Dan Slimmon - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Dan_Slimmon_clinical_troubleshooting_diagnose 8:34 Pravar Agrawal - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Pravar_Agrawal_debugging_cluster_oncall 9:02 Aleksei Popov - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Aleksei_Popov_maze_complexity_distributed_systems 9:38 Harel Safra - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Harel_Safra_infrastructure_ends_practical 10:02 Serter Kazim Solak - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Serter_Kazim_Solak_visualization_techniques_datasets reliability 10:38 Marco Pierobon - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Marco_Pierobon_challenges_platform_tips 11:17 Jaiprakash Pherwani - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Jaiprakash_Pherwani_manage_service_risk 11:55 Dmitrii Pakhomov - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Dmitrii_Pakhomov_resilience_fintech_strategies scaling 12:32 German Urikh - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_German_Urikh_code_review_tips 13:13 Pranay Prateek - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Pranay_Prateek_scaling_opentelemetry_kafka (no intro) Adam Gardner - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Adam_Gardner_reality_platform_engineering transformation 13:56 Pini Reznik - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Pini_Reznik_error_carbon_empowering (no intro) Ricardo Castro - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Ricardo_Castro_slos_eventbased_navigating 14:42 Chinmay Naik - https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Chinmay_Naik_devops_mlops_scaling 15:20 thank you!

Watch
Journey to Next-Gen AIOps: eBPF & GenAI | Michele Dodic & Anastasia Archangelskaya | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Michele_Dodic_Anastasia_Archangelskaya_journey_nextgen_aio Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:34 speakers 1:12 agenda motivation 1:43 key sre challenges 3:11 why do sres need aiops? aiops journey 4:29 transformation from monitoring to zero-touch introduction to ebpf 9:58 what is ebpf? 10:27 how does ebpf work? next-gen aiops 11:38 aiops: current challenges on the market 14:08 use case *1: risk assessment in containerized applications 15:47 use case *2: securing devops pipelines 17:13 use case *3: sre copilot powered by genai 18:36 demo architecture 23:54 conclusion takeaways 24:40 thank you!

Watch
Future-Proofing SRE: Integrating AI for Resilience and Efficiency | Asutosh Mourya | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_sre_2024_Asutosh_Mourya_futureproofing_integrating_resilience Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:36 how ai fits into sre workflow 1:38 intelligent filtering and prioritization 4:56 anomaly detection 7:29 analysis and summarisation 9:41 foreccasting and predictive analysis 12:04 reducing toil 14:17 challenges 16:36 thank you

Watch
SRE 2.0: Amplifying Reliability with GenAI | Indika Wimalasuriya | Conf42 SRE 2024

Read the abstract ➤ [abstract link] Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 2:01 sre 2.0 : amplifying reliability with genai 2:31 agenda 2:52 quick intro about myself 3:26 gartner sre hype cycle 4:24 sre 9:10 navigating digital transformation: managing ever-growing complexity 10:36 operations is a software problem 13:10 genai emerges: unveiling the power of next-gen artificial intelligence 13:40 unveiling the potential: the capabilityies of llm 15:15 navigating challenges: risks associated with llms 16:15 addressing model challenges: finding effective solutions 16:38 retrieval-augmented generation (rag) / knowledge bases 18:50 llm agents 20:57 prompt engineering best practices 21:43 prompt engineering properties 21:59 sre 2.0 23:33 genai in observability 26:17 use case - analyze log data to automatically identify root causes of performance issues 27:37 genai in sli, slo, and error budgets 29:43 use case - recommend optimal error budget allocations based on business priorities and user expectations 30:46 genai in system architecture and recovery objectives 32:33 use case - predict the impact of different failure scenarios on system availability and performance 33:23 genai in release & incident engineering 35:45 use case - provide real-time incident response recommendations based on the current situation and historical data 36:52 genai in automation 39:23 use case - analyze the effectiveness of automation workflows and recommend improvements bases on performance metrics 40:22 genai in genai in resilience engineering 41:43 use case - automate the execution of chaos experiments based on identified risk factors and failure scenarios 42:32 genai in genai in blameless postmortems 44:07 use case - analyze historical post-mortem data to identify recurring patterns and trends in incidents 45:02 measure progress with business outcomes 46:15 best practices 47:20 pitfalls to avoid 49:13 thank you.

Watch
Stress-testing Azure Resources using Chaos Studio | Peter De Tender | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Peter_De_Tender_stresstesting_azure_chaos Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:34 peter de tender 1:29 what is sre 2:40 the role of an sre? 3:19 what is chaos engineering 5:07 the curious case of cpu pressure 9:29 is chaos engineering - devops 3.0? 10:38 welcome to azure chaos studio 12:23 chaos experiments 14:28 azure chaos studio demo 32:26 resources 32:59 thank you!

Watch
Chaos Engineering in Action: Building Fault-Tolerant Systems | Hareesh Iyer | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Hareesh_Iyer_chaos_engineering_practical Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:53 agenda 1:15 why chaos engineering? 3:19 update about the october 4 outage 5:00 business impact of resilience is bigger than ever 5:19 why are these issues not surfaced during testing? 5:42 testing 7:19 what is chaos engineering? 10:03 testing vs experiments 11:37 chaos engineering: how to 13:09 #1: observe steady state 14:02 #2: plan hypothesis around the steady state 14:54 #3: run experiments 17:15 #4: verify and act 17:51 chaos engineering tools 18:59 thank you!

Watch
Building Secure Multi-Cloud Images with Multi-Boot Mode | Ederson Brilhante | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Ederson_Brilhante_building_secure_flexible Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:38 whoami 0:54 what will we see in this session? 1:14 general cicd overview 3:06 our context 7:49 our approach overview 9:56 project structure 10:39 builder suite structure 13:08 tester suite structure why this stack? 14:52 github actions 16:11 terraform 19:36 packer 21:30 ansible workflows 22:46 build image pipeline 23:30 test image pipeline 23:40 full pipeline workflow for regression tests: examples 24:14 regression tests: use case workflow for promotion image: example 27:24 promotion image: use case dev builds, vm debug and manual tests: examples 29:46 dev builds: use case 30:53 manual tests: use case code samples 32:47 github action - repos structure samples 35:48 composite action - aws build sample 37:43 reusable workflow - build sample 39:12 calling reusable workflow - build sample 40:59 regression tests in base image - pipeline sample 41:56 debug vm - pipeline sample 42:33 crucial takeaways 44:34 more about the stack 45:00 where to find me? 45:32 thank you

Watch
No WAFs: Don’t use a Web Application Firewall, and when you should | Joshua Fox | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Joshua_Fox_wafs_web_application Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:33 about joshua fox 1:03 doit 1:25 article 1:36 scenario 2:02 what is a waf? 2:16 drivers for getting a waf 2:22 hacker attack 2:35 penetration test 3:09 urgency 3:18 expertise 3:32 outside requirement/audit 4:00 security blanket 4:18 web threats 4:50 walktrhrough: cross site scripting 5:09 wihout waf 5:44 demo waf architecture 5:59 make it safe! 6:28 a simple chat message is executed 6:34 with waf 7:05 sql injection 8:02 ddos 8:10 why distributed? 8:33 application-level threats 8:39 broken access control 9:15 toss in a waf 9:20 how cloud armor works 9:25 architecture 9:57 policies and rules 10:16 rules 10:41 types of rules 11:28 preconfigured rules (use these!) 11:48 sensitivity (paranoia) 12:08 standard signatures 12:41 sample signature 13:03 rule language 13:28 waf won't protect you! 13:37 blocking your own app 14:24 false positives 15:06 job zero 15:33 secure your app 16:33 but the most important 16:43 ddos 17:27 ip address 17:45 geo 17:52 dry run 17:56 preview 18:26 problem with preview 18:47 false negatives 18:54 imperfection detection 19:17 the worst: broken access control 19:40 attackers shift 19:57 attackers are smart 20:08 flexibility? 21:22 waf adds risk, man-in-the-middle 21:37 risk: complacency 22:01 risk to performance 22:12 pricing 23:03 at long last... 23:07 eternal requirement 23:18 third-party apps 23:33 central supervision 24:36 the one go-to feature 24:43 consider advanced services 25:11 if you're going to do it, do it now 25:19 prefer your cloud's waf 25:46 minuses of waf 26:09 plusses of a waf 26:31 conclusion 26:45 we're hiring!

Watch
Zero-instrumentation observability based on eBPF | Nikolay Sivko | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Nikolay_Sivko_zeroinstrumentation_observability_ebpf Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:33 observability is ... 1:06 systems a while ago 1:42 modern systems 2:31 making a system observable 4:32 collecting telemetry data 10:50 a quick intro into ebpf 11:42 how to use ebpf 15:24 caroot-node-agent (apache 2.0 license) 16:32 how the agent leverages ebpf 21:32 ssl 22:59 ebpf: performance impact 25:13 ebpf-based metrics 28:18 ebpf-based traces (spans) 28:47 ebpf-based tracing limitations 31:05 ebpf-based continuous cpu profiling 32:10 ebpf-based cpu profiling 33:10 how coroot works 34:10 conclusion 34:54 thank you, let's connect!

Watch
2024: I Don't Know K8S and at This Point, I'm Too Afraid To Ask | Alex Dejanu | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Alex_Dejanu_k8s_strategies_ask Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:55 agenda 1:14 whoami 1:34 history 3:37 are you looking for a job? 4:18 kube_flex 5:59 architecture 6:48 architecture 7:28 controller 7:50 operators 8:37 operator under 5min - demo 19:50 more memes 20:58 conclusions 21:55 thank you

Watch
Building reliable product through SRE community | Jorge Luis Castro Toribio | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Jorge_Luis_Castro_Toribio_building_reliable_community Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 1:21 jorge luis castro toribio 2:15 in this talk 3:41 let's take care of the basics 3:55 what is sre? 10:03 #real_life 11:59 some of our problems 14:20 community of practices (cop) 15:21 our first thoughts 17:05 sre community of practices (srecop) 19:42 sre cop team 22:04 cop is an investment 22:24 sre cop aligned to business strategy 29:22 some metrics recommendation 29:55 how do we make cop last longer and more engager? 30:15 meg: minimum enjoyable game 30:44 octalysis 31:52 what we achieved? 32:42 learned lessons 33:42 to sum up 34:07 books 34:26 thank you!

Watch
Resilient Systems & Teams: A CTO’s Blueprint for SRE Excellence | Evgenii Korneev | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Evgenii_Korneev_resilient_teams_blueprint Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:42 topics 1:34 do startups need sre? 4:25 on-call duties 7:17 when do you need a dedicated sre engineer or sre team? 9:47 communications 11:24 strong leads are key to success 12:26 system resilience is a comprehensive set of measures 13:40 thank you!

Watch
How to Avoid Being an Agile Victim | David Argent | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_David_Argent_avoid_agile_victim Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:44 agenda 1:02 introduction 2:13 a (very) brief review of agile 4:25 how to fail 4:40 failure 101 6:03 failure 201 7:49 failure 301 10:54 think before you code 11:19 design 101 15:04 design 201 17:42 design 301 20:08 balancing tactics and strategy 20:37 tactics 101 23:53 strategy 101 25:34 strategy 201 27:13 code is not your only deliverable 27:38 deliverables 101 28:23 deliverables 201 30:23 summary 31:29 thank you

Watch
Clinical troubleshooting: diagnose production issues | Dan Slimmon | Conf42 SRE 2024

Read the abstract ➤ [https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Dan_Slimmon_clinical_troubleshooting_diagnose Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:59 who's this guy? 1:39 clinical troubleshooting 26:42 reach out to dan

Watch
Debugging cluster issues as an on-call SRE | Pravar Agrawal | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Pravar_Agrawal_debugging_cluster_oncall Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:40 agenda 1:21 whoami 1:39 introduction to sre 2:55 understanding on-call process 4:23 some common cluster issues 6:32 approach to debugging 8:10 automation to the rescue? 9:16 shades of automation 12:06 advice for beginners 13:09 thank you!

Watch
Mastering the Maze: Navigating Complexity in Distributed Systems | Aleksei Popov | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Aleksei_Popov_maze_complexity_distributed_systems Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:33 agenda 0:49 what is a distributed system? 1:41 what is complexity? 2:24 monolithic architecture 2:56 disadvantages 3:54 microservices architecture 4:42 what do distributed systems give us? 6:58 challenges, quality attributes 9:05 what are main troubles? 11:28 unreliable networks 12:44 strategy: timeout 13:13 strategy: retry 14:01 strategy: idempotency 14:51 strategy: circuit breaker 15:42 concurrency and lost writes 16:22 strategy: snapshot isolation 18:49 strategy: compare and set 20:03 strategy: lease 20:51 dual write problem 22:25 strategy: transactional outbox 23:08 strategy: log tailing 23:29 unreliable clocks 25:16 availability and consistency 25:41 high availability 25:55 failure 26:24 consistency types 26:44 linearizability 27:23 strategy: distributed consensus algorithm, e.g. raft 29:16 more complexities 30:03 eventual consistency 30:56 strategy: read from leader 31:19 process pauses 32:00 strategy: fencing 34:01 observability 34:24 strategy: distributed tracing 35:40 strategy: orchestration over choreography 36:36 evolvability and cybernetics principles 37:32 systems thinking 38:30 feedback loops 39:13 adaptability and learning 39:46 goal-oriented design 40:40 big ball of mud 41:30 hierarchy 42:04 fallacy: all microservices are the same 42:22 strategy: service types 45:30 sre principles 48:24 infrastructure as code 49:11 chaos engineering and testing: jepsen tests 50:39 simplicity and measuring complexity 51:58 thank you for attending!

Watch
When Infrastructure as Code Ends - Creating Terraform Providers | Harel Safra | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Harel_Safra_infrastructure_ends_practical Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:40 agenda 0:49 about me 1:22 infrastructure as code 2:24 terraform providers 3:05 architecture 4:01 the api 4:23 golang 5:36 documentation 6:05 writing the provider 6:10 about the demo we'll use 7:27 terraform plugin framework 8:07 create / read / update / delete 10:53 schema & attributes 13:25 types 14:58 local running and debugging 18:40 acceptance tests 21:41 publishing 23:09 my journey 23:54 thank you for your time!

Watch
Advanced Visualization Techniques for Complex Data Sets | Serter Kazim Solak | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Serter_Kazim_Solak_visualization_techniques_datasets Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 3:01 introduction 5:46 types of data visualization 6:48 interactive visualization 12:02 visualizing large data sets 22:52 data visualization best practices 24:51 conclusion

Watch
Challenges of Platform teams (and a few tips to overcome them) | Marco Pierobon | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Marco_Pierobon_challenges_platform_tips Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:39 marco pierobon 0:59 thoughtworks' purpose 1:21 technology excellence then, now and next 1:44 we wrote the book on it 2:04 agenda the challenges 2:24 devex 4:05 business 5:16 technology 6:57 platform teams how to overcome these challenges 8:44 devex 10:29 business 11:52 technology 13:39 platform teams practices & product thinking 15:35 agile in platform teams 17:17 sdlc in platform teams 19:47 team topologies 21:41 product (portfolio) management 24:10 summary

Watch
Manage service reliability by managing risk | Jaiprakash Pherwani | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Jaiprakash_Pherwani_manage_service_risk Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:47 slo's - realistic? 1:41 risk analysis 2:57 risk catalog 5:20 typical risk catalog 7:45 rate your risks 9:58 acceprting risks 12:31 leverage chaos engineering 14:00 thank you

Watch
Scala-Powered Strategies for Building Fault-Tolerant Systems | Dmitrii Pakhomov | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Dmitrii_Pakhomov_resilience_fintech_strategies Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:31 about me 1:07 what is resilience? 1:29 bulkhead 4:57 cache 5:48 inmemory cache 10:37 reloadable config 11:41 fallback 12:57 retry 14:10 circuit breaker 15:24 summarize 15:51 fin

Watch
Code Review: How To Not Drive Your Colleagues Into Depression | German Urikh | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_German_Urikh_code_review_tips Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:43 what is code review? 1:06 why do we need it? 2:55 best practices 9:23 thank you!

Watch
Scaling Opentelemetry Collectors using Kafka | Pranay Prateek | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Pranay_Prateek_scaling_opentelemetry_kafka Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC SigNoz on slack ➤ https://signoz-community.slack.com/join/shared_invite/zt-2gag5t3k4-WE5I6xpNbczyDJNdLLJkAg#/shared-invite/email Chapters 0:00 intro 0:26 preamble 0:33 about me 0:51 signoz - open source observability platform 1:46 what is opentelemetry? 2:19 why is opentelemetry important? 4:14 introduction to opentelemetry collector 6:00 architecture of signoz cloud (single tenant) without kafka 7:09 issues with scaling with just opentelemetry collector 8:17 architecture of signoz cloud with kafka 10:01 how kafka can help 12:32 kafka setup, records 13:44 monitoring consumer lag is important 15:03 scaling based on consumer lag 16:15 monitoring producer - consumer latency 16:44 kafka based architecture is working well so far... 17:29 potential improvements 18:51 get involved in a growing community 19:20 thank you

Watch
The reality of Platform Engineering at Enterprise Scale | Adam Gardner | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Adam_Gardner_reality_platform_engineering Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:36 platform engineering at dynatrace 0:54 agenda 1:21 whoami 1:38 the beginning... hands on training days 3:01 the platform 4:28 platform deployment 4:38 platform capabilities 5:35 use cases across dynatrace 7:18 discoveries 8:48 enforce policies / guide to best practices 9:30 lessons learned 10:22 don't gold plate! 11:16 mvsp exceptions 12:31 where next? 13:47 thanks!

Watch
Carbon Budgets: Empowering SREs for Eco-Efficiency | Pini Reznik | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Pini_Reznik_error_carbon_empowering Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:43 how bad are bananas 0:59 why? 3:01 regulation 4:39 fines and carbon taxes? 5:15 understanding it emissions 8:54 software carbon intensity (sci) specification 10:16 wrong ways to reduce emissions 13:42 a better process 14:53 focus areas 15:17 sre 15:25 purpose of sre 15:49 key metrics in sre 16:55 slo 17:29 sla & sli 17:52 measurement tools 18:58 aether - carbon observability 19:58 continuous profiling 20:16 profiling talks 20:46 how to achieve even more 21:19 the sustainable 50% challenge 21:56 other waste patterns 25:19 extending hardware life 25:31 wrapping up 25:58 some random examples 29:58 what can i do now?

Watch
SLOs for Event-Based Systems | Ricardo Castro | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Ricardo_Castro Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble

Watch
DevOps to MLOps: Scaling ML Models to 2 Million+ Requests per Day | Chinmay Naik | Conf42 SRE 2024

Read the abstract ➤ https://www.conf42.com/Site_Reliability_Engineering_SRE_2024_Chinmay_Naik_devops_mlops_scaling Other sessions at this event ➤ https://www.conf42.com/sre2024 Support our mission ➤ https://www.conf42.com/support Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 0:26 preamble 0:36 chinmay naik 1:12 agenda 1:38 what is mlops 2:30 mlops steps 4:05 simpelst mlops flow 6:11 production work ahead 6:37 case study - ekyc saas apis 7:01 ml model apis 8:20 architecture 9:54 ekyc saas apis - requirements 10:51 cloud agnostic architecture 13:45 why cloud agnostic? 14:34 scaling journey 16:19 eliminate single points of failure 18:36 capacity planning 20:35 cost optimization and autoscaling 24:27 production issue 1 - gpu utilization in nomad 27:41 production issue 2 - high latency issue 30:57 lessons 33:33 keep learning

Watch