List of videos

Contract Driven Development | Hari Krishnan | Conf42 Incident Management 2022
Our largest hurdle in deploying a MicroService was the Integration Testing stage. Just one incompatible API was enough to break the integration environment and block the path to production for all services. While adopting OpenAPI helped address some of the communication gaps in API specs between teams, the deviations during implementation continued to persist. We needed an approach that changed the way teams collaborated on API Specs and also remove the need for integration testing. To fill this need we came up with Contract Driven Development which consists of 1. Contract as Test - Contract (Example: OpenAPI) translated to Test Scenarios against the API implementation. Ensures that Provider (API implementation) adheres to Contract. 2. Smart Service Virtualisation - Verify Stub Data against OpenAPI Spec. Ensures the Consumer (API Client) is compatible with Provider's Contract. 3. Backward Compatibility Testing - OpenAPI vs OpenAPI (no code) to check if versions are backward compatible. Helps teams analyse if a change will break integration. Takeaways: 1. Issues with Integration Testing - The problem statement 2. Executable API Specifications - The role of API Specification Standards in eliminating Integration Tests 3. What is Contract Driven Development? Metrics to understand ROI Target Audience - CTOs / Heads of Engineering / Technology Leaders, Senior Engineers Pre-requisites - API Design Basics, Backward Compatibility, Service Virtualisation, Experience with Contract testing will be a bonus Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 1:40 Talk
Watch
Running an effective incident management process | Nishant Roy | Conf42 Incident Management 2022
When working on technical systems, it's inevitable that something will at some point break. Therefore, it is extremely important to be prepared for how to handle such a situation, and ensure you and your team are doing what you can to minimize the downtime for your users and/or customers. Outages and similar drops in availability can occur in any system, whether it's user facing, employee facing, revenue generating, or recommendation generating. By minimizing the time taken to respond to and resolve such an incident, you can make sure to minimize the impact to your topline, customers, and users. This is a tech talk intended to share some ideas with developers, SREs, and managers, on how to run a healthy incident management process within your organization. When things go wrong, it's extremely important to remain calm, reduce and resolve the impact, retroactively identify the top learnings, and follow up to make your systems more resilient to such outages in the future. As the engineering manager for the Ads Serving Platform team at Pinterest, which owns multiple business critical services, and a member of the incident manager oncall for Pinterest, I have experienced several high-severity incidents, and have learned a lot from this process. I hope my learnings can be of use to others in similar positions to run an effective incident response process. By the end of this talk, the audience should be able to answer the following questions: 1. What differentiates an incident from a bug? 2. How can we empower our team(s) to reduce adverse impact to our customers/users/business? 3. How can we build a healthy culture around learning from our mistakes and growing together? 4. How to measure and track improvements to our incident management process? Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 1:40 Talk
Watch
Use Chaos Engineering to improve incident response | Eran Levy | Conf42 Incident Management 2022
As engineers, we used to write code that was interacting with a well defined set of other applications. You usually had a set of services that were running in well defined environments. The evolution of cloud native technologies and the need to move fast, led organizations to redesign their structure. Engineers are now required to write services that are just one of many other services that usually solve a certain customer problem. Your services are smaller than what they used to be, they aren’t alone in a vacuum and you have to understand the problem space that your service is living in. These days engineers aren’t just writing code. They are expected to know how to deal with Kubernetes, HELM, containerize their service, ship to different environments and debug in a distributed cloud environment. In order to enhance engineers' cloud native knowledge and best practices to deal with production incidents, we started a series of workshops called: “On-Call like a king” which aims to enhance engineers knowledge while responding to production incidents. Every workshop is a set of chaos engineering experiments that simulate real production incidents and the engineers practice on investigating, resolving and finding the root cause. In this talk I will share how we got there, what we are doing and how it improves our engineering teams expertise. Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 1:40 Talk
Watch
Cameras & Clocks: Enterprise IoT Security Sucks | Brian Contos | Conf42 Incident Management 2022
Enterprise Internet of Things (IoT) security today is analogous to IT security in the mid 1990s. It was a time when security awareness was limited, countermeasures and best practices weren’t broadly applied, and attackers explored, compromised, controlled, and exfiltrated data from systems with minimal resistance. In short, enterprise IoT security sucks as bad today as that unpatched Windows NT 3.51 server with an RS-232 connected modem that IT forgot about. Working globally with Fortune 500 enterprises and government agencies we’ve interrogated over two million production IoT devices. Across these two million devices we’ve identified threats and trends, compiled statistics, summarized compelling cases, and evaluated common offenders. We’ve also assembled tactics that organizations can employ to recognize value from their IoT devices while minimizing risk and ensuring that devices that are secure today will stay secure tomorrow. Security issues are compounded by the quantity of IoT devices. Our analysis indicates that most organizations have about five IoT devices per employee. The global IoT market has grown from $100 billion in 2017 to over $1 trillion in 2022. There are over 46 billion connected devices today and 30 billion (65%) of those devices are IoT. We are increasingly dependent on consumer, enterprise, industrial, and military IoT devices for cost reduction, supply chain logistics, productivity gains, security, and everything in between. Despite the criticality of IoT, our security hasn’t kept pace. In the enterprise, we’ve identified that we simply don’t know: ● What IoT devices we have - guesses based on legacy asset discovery solutions are consistently off by at least 50% ● When our firmware was last updated - in many cases the firmware is end of life and the average IoT firmware age is six years ● If our credentials follow organizational policies - passwords that are default, low-quality, don’t have scheduled rotations, and lack centralized management are the norm ● How vulnerable our IoT devices are - at least half of the IoT devices we’ve interrogated have known, high to critical level CVEs While enterprise IoT security currently sucks, it doesn’t have to be that way. By evaluating the security risks and the inherent limitations of IoT, you can leverage tactics that will have a rapid and positive impact on security. Attendee takeaways: ● Discover your IoT devices, diagnose their security, and define their limitations ● Employ tactics to improve your IoT security and communicate their status to stakeholders ● Restate key findings derived from the interrogation of two million production IoT devices Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 1:40 Talk
Watch
Varieties of Incident Response | Kurt Andersen | Conf42 Incident Management 2022
Have you ever wondered if there was a better way to respond to incidents? When you are in the midst of an incident, does ""the process"" help you and your teammates or is it more of a burden? There have been a variety of approaches to organizing people and teams over the 30+ years of online services. Each of them have benefits and drawbacks. This talk will dive into a representative set of these approaches to examine them and help the audience to have a wider context by which they can evaluate their own arrangements for incident response. The talk will also look at incident response from a more abstract, task/intent-focused perspective to give a framework against which processes can be examined and adjusted to be more enabling, less burdensome. (And no, this is not a lite beer commercial ;-)) Other talks at this conference 🚀🪐 https://www.conf42.com/im2022 — 0:00 Intro 0:39 Talk
Watch
Premiere - Conf42 Incident Management 2023
Schedule, Lineup & RSVP ➤ https://www.conf42.com/im2023 Join Discord ➤ https://discord.gg/DnyHgrC7jC Upcoming CFPs ➤ https://www.papercall.io/events?cfps-scope=&keywords=conf42 0:00 Sponsors and Partners security 2:33 Christopher Haller ai 3:05 Christine Yen 3:52 Joshua Arvin Lat & Sophie Soliven 4:22 Anurag Gupta 4:56 Mahdi Jelodari lessons learned 5:33 Paige Cruz 6:09 Andre Carvalho culture 6:43 Einat Mahat 7:15 Tanya Janca 7:52 Dor Amram 8:48 thank you, join Discord ➤ https://discord.gg/DnyHgrC7jC
Watch
Atomic Red Team: Closing the Gap with Threat Actors | Chris Haller | Conf42 Incident Management 2023
Read the abstract ➤ https://www.conf42.com/Incident_Management_2023_Christopher_Haller_atomic_red_team_threat_actors Other sessions at this event ➤ https://www.conf42.com/im2023 Join Discord ➤ https://discord.gg/DnyHgrC7jC Reach out to Chris ➤ chris.haller@strongcrypto.com Chapters 0:00 intro 2:08 preamble 2:41 agenda 3:00 who is chris 5:18 the problem 8:05 knowns matrix 9:41 a solution 12:09 mitre att&ck 13:32 procedures 14:40 atomic red team 18:02 breach attack simulation (on a budget) 19:52 atomic test #22 - winpwn - powersharppack - seatbelt 21:30 atomic test #3 - dump active directory database with ntdsutil 23:18 conclusion 24:36 questions?
Watch
Leveraging SRE and Observability for the World of Building on LLMs | Christine Yen | Conf42 IM 2023
Read the abstract ➤ https://www.conf42.com/Incident_Management_2023_Christine_Yen_sre_observability_building_llms Other sessions at this event ➤ https://www.conf42.com/im2023 Join Discord ➤ https://discord.gg/DnyHgrC7jC Honeycomb Blogpost about LLMs ➤ https://www.honeycomb.io/blog/hard-stuff-nobody-talks-about-llm Chapters 0:00 intro 2:08 preamble 2:34 magic of llms llms 3:39 - like apis we know and love 7:43 - even more unpredicability 9:03 - how do we define "correct"? observability 10:36 about 11:48 - what's in the box 14:20 - endless feedback loops 16:01 why believe me? query assistant 16:18 - timeline 16:31 - goals 19:14 laws of building on llms 20:06 how do we go forward? instrumentation 21:24 instrumentation for llms 25:01 emerging behaviors 28:06 a truth for llms 28:34 service level objectives 29:00 slos: a quick definition 30:06 slos for developing with llms 32:07 from others in the wild 32:20 duolingo 33:34 intercom 35:01 so in the end: 35:38 thanks!
Watch
Pragmatic Automation Strategies | Joshua Arvin Lat & Sophie Soliven | Conf42 IM 2023
Read the abstract ➤ https://www.conf42.com/Incident_Management_2023_Joshua_Arvin_Lat_Sophie_Soliven_automation_strategies Other sessions at this event ➤ https://www.conf42.com/im2023 Join Discord ➤ https://discord.gg/DnyHgrC7jC Chapters 0:00 intro 2:08 preamble 2:30 about sophie and joshua 3:27 opening case 3:50 let's define incident 4:26 managing incidents 6:13 automation strategies 7:16 chatbots 8:02 pragmatic automation 9:34 leverege existing tools 10:52 enhance incident management with ai 11:40 ai-powered root cause analysis 13:07 automated remediation 14:13 automated tagging 14:51 ai-powered monitoring and alerting 15:37 accelerate documentation 16:26 that's it, thank you!
Watch