Conf42 Site Reliability Engineering (SRE) 2022
2022
List of videos

Premiere - Conf42 Site Reliability Engineering 2022
Conf42 Site Reliability Engineering 2022 is here! Schedule, Lineup & RSVP: https://www.conf42.com/sre2022 Join Discord to interact: https://discord.gg/DnyHgrC7jC 0:00 sponsored segment 5:02 intro, sponsors & partners Keynotes 6:07 Kurt Andersen - Blameless 6:58 Malcolm Preston - FireHydrant 7:28 Cristina Buenahora - Cortex 8:06 Kolton Andrus - Gremlin 8:56 Priya Singh - Mulesoft 10:01 Rajalakshmi Srinivasan - Site24x7, Zoho Corp. 10:47 Michael McAllister - Teleport 11:16 Niladri Choudhuri - DOIS by Xellentro getting started 11:48 Christian Elsen & Lerna Ekmekcioglu 12:37 Miriah Peterson 13:04 Ricardo Castro 14:01 Kazuki Higashiguchi 14:49 Navendu Pottekkat 15:17 Richard Lewis 15:44 Rohit Sinha security 16:25 Martin Wimpress 17:07 Peter ONeill tools 17:36 Bryan Barkley & Vivek Deshpande 18:20 Piyush Verma observability 19:04 Samuel Arogbonlo 19:40 Dotan Horovits 20:14 Abhijeet Mishra 20:46 Aliaksandr Valialkin 21:15 Dave McAllister lessons learned 21:51 Hila Fish 22:36 Lisa Karlin Curtis 22:51 Xe Iaso 23:30 Ramon Medrano Llamas 23:58 Eran Levy 24:37 Henrik Rexed 25:41 Travis Gosselin 26:35 Andrew Knight 26:53 Paul Marsicovetere 27:20 Thank you, Join our Discord to interact! https://discord.gg/DnyHgrC7jC
Watch
Implementing a Learning Team: A real-world case-study | Kurt Andersen | Conf42 SRE 2022
When you have an executive breathing down your neck, looking for answers, but no one person has the necessary information to provide those answers, what do you do? This talk uses a case study from a cross-team collaboration effort at LinkedIn. The work was intended (and succeeded) to address specific knowledge, process, and skill gaps between teams via the methodology of a "learning team" approach. With this specific case study as a focal point, the talk will connect sources from the wider literature about organizational learning and learning teams to highlight processes, organizational structures, and skills that can be used to foster a healthy work environment with inclusivity and inter-team camaraderie while also achieving important business metrics and getting answers for that executive! Participants will learn: - principles guiding the implementation of learning teams, - strengths and applicability for learning teams, - how to foster a more humane and effective workplace by appreciating the importance of ""work as done"" above work as imagined Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
We can’t all be Shaq: it’s time for SRE hero to pass the ball | Malcolm Preston | Conf42 SRE 2022
Shaquille O’Neal is one of the most celebrated NBA players of all time — and for good reason. When the team needed to put up quick points, they knew they could throw the ball to Shaq, and let him go to work. The skills are different, but there are a lot of engineers playing the Shaq role at their company. They’re the heroes who come in at 2 a.m. knowing just what to do to remediate fast and get back on track. Although that might win games and resolve incidents, it’s not setting your team up for sustained success. Attend this talk to understand why it’s time to pass the ball and learn three ways you can take fast action to get there today. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
4 key metrics for measuring your team's performance | Cristina Buenahora | Conf42 SRE 2022
DORA metrics have become the standard for gauging the efficacy of your software development teams, and can provide crucial insights into areas for growth. These metrics are essential for organizations looking to modernize, as well as those looking to gain an edge against competitors. In this talk, we’ll dive into each one, and discuss what these metrics can reveal about your development teams (and my engineering skills). Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Reliability 101 | Kolton Andrus | Conf42 SRE 2022
It is 7 AM; you awake after a night of uninterrupted slumber. Being on-call, you check for issues, was your pager out of batteries? Nope, things are quiet. Imagine a world where outages are a myth. Where a failure occurs, but there is no customer impact and no engineer is engaged. This is the aspiration of Reliability Engineering - to operate complex distributed systems effectively, without customer facing outages or heavy operational burden. In this 101 talk, I will share the basics every team should know to start their reliability journey off on the right foot. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Build the next evolution of composable architecture using GraphQL | Priya Singh | Conf42 SRE 2022
With an ever-growing number of APIs, databases, and applications to interact with AND an ever-growing number of consumers that need to access data from them, companies are realizing they need a way to streamline consumption from all these new data sources. But where do we go from here? This is where technologies like GraphQL truly shine - they are the evolution of consumption of a composable architecture. Using GraphQL, you can easily unify a massive number of APIs into a single endpoint, which is governed by a unified schema. This unified schema allows consumers to request data from ANY field that exists within ANY of the API’s associated with it. This means actions that would have normally taken 5, 10, or even 20 different queries can now be done in just 1! In this session we are going to see how developers can automate the API unification process, how you can leverage a GraphQL endpoint to federate your REST APIs, and a quick peek behind the curtain to show how MuleSoft’s GraphQL solution, is able to provide you with the performance you expect. At MuleSoft, we’re helping customers go digital faster by turning every asset in the organization into reusable building blocks. These building blocks are the foundation of composability. This concept of composability is seen first-hand through MuleSoft’s API-led architecture. In fact, the API-led or “API-first” approach works so well that it has pretty much become the industry standard. However, innovation in how development teams use and interact with the application network is far from over. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Four Golden Signals: Monitoring the health of your service | Michael McAllister | Conf42 SRE 2022
As Site Reliability Engineers it is our mission to ensure our services are highly available, secure, and scalable. With hundreds or thousands of different metrics across a (potentially) distributed system that you could monitor and alert on, where do we begin? How do we define what it means for a service to be ""healthy""? This lightning talk focuses on the four golden signals of monitoring that serve as a solid foundation for actionable monitoring of the health of your service. In this talk we'll explore what the signals are, what they mean for you and your customers, and put what we've learnt into action with monitoring a demo application. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Future of observability in an experience-driven economy | Rajalakshmi Srinivasan | Conf42 SRE 2022
AI has started to scratch the surface when it comes to deciding the future of observability. But it’s a long strenuous journey ahead as the technology is handicapped without the much-needed guidance of the DevOps and the SRE (observability) folks. In the next few years, the observability’s future will be determined by shrewd IT teams prepping their systems and software by collecting, streamlining, and optimising the data (metrics, logs and traces) for the much-hyped AI-driven world. Troubleshooting and securing applications will get tougher as applications get heavy and complex with changing business needs, and as the experience economy kicks in. What does this mean to DevOps and SRE teams? This talk will discuss what vendors and IT teams ought to do in preparing for an experience economy. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
SRE Anti-patterns | Niladri Choudhuri | Conf42 SRE 2022
Based on my experience, I see many of the organizations have completely missed the mark. Some are doing the same as they used to do, and others are doing SRE activities in bits and pieces. In this session, Niladri will be talking about such few anti-patterns and what needs to change, why and how. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Optimize connectivity for multi-region archs | Christian Elsen & Lerna Ekmekcioglu | Conf42 SRE 2022
Enhancing global user experience, meeting data residency requirements or ensuring business continuity are a few reasons for building multi region applications on AWS. Consistency in application performance and availability for end users can be one key consideration in multi-region architecture design. In this session, we cover how to optimize the availability and performance of end user connectivity to the multi region application on AWS. We provide best practice guidance based on the type of the application and go over practical use cases leveraging AWS Global Accelerator and Amazon Route53. We dive deep into their key features benefiting multi-region architectures along with a demo. The session is targeted for cloud teams who are looking to build performant and resilient multi region architecture on AWS for their end users including those in regulated industries where security and reliability are critical such as Financial Services and Healthcare. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
What is Data Reliability Engineering? Why it is Crucial? | Miriah Peterson | Conf42 SRE 2022
Software practitioners work to make their systems reliable. We hear teams boasting of having four or five 9s of uptime. This is not the case for Data Services. Data is not often 99.999% reliable. Systems are often out of date or out of sync. Pipelines and automated jobs fail to run. And, sometimes, the data sources are just not accurate. All these situations are examples of Data Downtime and lead to misleading results and false reporting. Data Reliability Engineering is the practice of building resilient systems. By treating data systems as an engineering problem we can borrow tools and practices from SRE to build better systems. Together let’s explore how to take this natural extension of data engineering to make our data systems stronger and more reliable. We will explore three major topics to strengthen any pipeline: - Data Downtime: We will talk about what is Data Downtime? How does it affect your bottom line? And How to minimize it? - Data Service Level Metrics: We will talk about metadata for your Data pipeline? How to report on pipeline transactions that can lead to preventative data engineering practices. - Data monitoring: What to look out for and how to be aware of system failure verse data failures. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Alerting on SLOs and Error Budget Policies | Ricardo Castro | Conf42 SRE 2022
Assessing your system's reliability through SLOs is a great way to really understand and measure how happy users are with your service(s). Error Budgets give you the amount of reliability you have left before users are unhappy. Ideally, you want to be alerted way before users are dissatisfied and take the appropriate measures to ensure they aren't. How can you achieve that? That's where alerting on SLOs and Error Budget Policies come into the picture. By tracking how happy your users are, through SLOs, and alerting way before their level of insatisfaction reaches critical levels you’ll be able to define policies to deal with issues in a timely manner, ensuring operational excellence. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
A guide to join operational works in your new DevOps team | Kazuki Higashiguchi | Conf42 SRE 2022
"This talk will explain my experience when I join in a new engineering team. I mean a DevOps team as a team which includes developers and IT operations working collaboratively throughout the product lifecycle. Bottom-up approach Imagine, you don’t have much room to learn everything about your service at once. Therefore, let's chunk continuous learning. Learning process is observed - recorded - action. In the observe phase, let's jump into alerts and exceptions even if you’re not familiar with them. Gradually, you will start to know what are critical points for our service (i.e. website down). In the recording phase, let's write up missing documentation behalf of teammates. Even if you cannot solve issues directly, it would be helpful to your team. There, writing Runbooks give you chances to understand your service and participate in operations. I would recommend you write the ""Architecture"" part to organize your understanding of your service's technical design. In the action phase, let's try service operation (e.g. fixing broken data by manual operations). Pair operation is a good idea to jump into service operational works. Top-down approach Check the big picture of your service reliability (e.g. is there any SLA? SLO? SLI?) Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
SRE Best Practices for API Design | Navendu Pottekkat | Conf42 SRE 2022
In modern development teams, site reliability engineers (SREs) are the glue that holds developer and operations teams. It is the goal of the SREs to increase the reliability of their services to meet production standards by setting up monitoring, ensuring proper resource allocation, rolling out updates gradually, and anticipating the cost of failures. As the size of the APIs increase, the need for making them reliable and robust also increases. In this talk, Navendu will talk about best practices for API design that lean towards reliability. Attendees will learn about: - Reliability issues in traditional API design. - How SREs fit in the API development pipeline. - Modern API development best practices using API gateways. - How SREs can combine DevOps practices to build more reliable services. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
LMAO Helps During Outages | Richard Lewis | Conf42 SRE 2022
This short fast pace talk is aimed towards those who might be on pager duty. Richard will cover the four things that will help you survive any outage nightmare. Cause when your company has an outage no need to sweat just remember to LMAO and you’ll get through it. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Transformation & Cultural Shift using SRE & Data Science | Rohit Sinha | Conf42 SRE 2022
Transformation of Elephant style monolithic organization is a difficult task. As part of current role to lead the transformation to an SRE cultural is a challenge, this is a journey and as an organization we shall continue to evolve. SRE Implementation was not focused on only having SRE team for the organization but a bootstrap concept was introduced which would manage the SREs who are in each and every unit (Every team) under the guidance of Bootstrap. This has been effective to ensure penetration in day to day working of the organization and that's when the wheel starts turning. The outcome has been every tribe has a designated set up of SRE across the organization. The usage of AI/ML for infrastructure based analysis of ITSM data as well as Infrastructure log data to ensure the SREs are armed with best predictions for the infra and component to take the best decisions. This has helped in cost reduction but also helped in ensuring that the legacy tooling is slowly shaved out. This has been a data driven approach to systematic roll out of SRE set up. In house tooling culture was put as part of the release train to ensure the demand & the interest from the operations team is in tact. The tooling cannot be suddenly changed over night but had to accommodate the old and the new to ensure no service disruption. This was achieved by having an architecture which combines old and the new, ensuring the old set up of automation works while the new age DevOps based tooling and integrate API solutions are also integrated. The predication based set up would Definitely change how ITIL looks at Operations for ever as there would be no static thresholds in future and incident based set up would be gone .(More about it later ;)) Problems are plenty and especially for an organization which provides services, as revenue defines each big customer and sometimes these units which cater to a particular customer starts acting like individual companies and tries to define standards and ways of working. While SRE rollout in industry is not unique but surely the set up is unique. The usage of AI/ML for ITSM and Infra data is the uniqueness. The prediction using the logs of different parameters of the system and combining with corraltion matrix of close to 30 odd parameters of a machine . This in house built set up, combined with other parameters makes a unique experience gives SRE an edge already to perform what they are good at . The analysis is there , its about applying the principles. This was unique as a monolithic organizations try for an immediate benefit and we were able to provide it with combination of re structure in the internal services & ensuring SREs are not just introduced as a concept from top down but going to designate SRE culture at every tribe level. This was orchestrated by a bunch of best SREs ( under the SRE bootstrap) who provided constant guidance to the tribe designated SREs and also lead the bigger tooling and AI based Usage initiatives. The predication based set up would Definite change how ITIL looks at Operations for ever as there would be no static thresholds in future and incident based set up would be gone .This set up works well as then culturally, at every team level we are changing and penetrating. It had it own set of challenges and push backs which is out of learning . Modernization using SRE - The entire detailed set up is an amalgamation of SRE + usage of AIOPS based structure. While SREs part is more about following the culture and principles , the analytics based tech set up is like providing them with ammunitions to be proactive and predictive in problem solving. The above thought process covers the 2 most important principles, Monitoring (advanced & Simplicity (something which is an underlying horizontal). Emphasis is on visibility engineering, to ensure everything is seen, can be tracked, predicted and can be controlled. This is not just limited to the landscapes, ecosystem only but is being used for ITSM components as well. Process simplification has been kick started based on data analysis of why and how this time can be reduced which ties back to the another Principle, elimination of toil. New addition, usage and definition of Error Budget definition, how proper, guideline based error budget can help. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
5 Security Best Practices For Production Ready Containers | Martin Wimpress | Conf42 SRE 2022
No developer wants to be part of the morning news due to a security breach, but most application developers would rather write great code and ship cool new features than patch security issues. Thankfully, with simple best practices, developers can significantly reduce the attack surface of their containers before shipping them to production. At Slim.AI, we analyzed hundreds of container images accounting for billions of pulls annually to better understand the risks facing developers today. This talk provides highlights of our investigation and simple steps developers can take to address security issues in their containers BEFORE they get to production. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Terraform apply secured by Open Policy Agent | Peter ONeill | Conf42 SRE 2022
Terraform has unprecedented control over the mission-critical infrastructure for our businesses and organizations. Think about the last time a misconfiguration went unnoticed for long enough to impact customers or cause an outage. Everyone should have a second set of eyes when deploying code that has the potential to create a negative impact. Let Open Policy Agent (OPA) be that second set of eyes. OPA is an open source general-purpose policy engine that is especially adept at working with configuration data like Terraform manifest files. Using OPA, we can write policies that will ensure that resources created by any team and any engineer are compliant with the organization’s rules and requirements. Implementing policy can be challenging, but it doesn’t have to be. OPA comes paired with a purpose-built dedicated policy language called Rego. This talk will show how to get started by deploying an OPA into your CI/CD pipeline and writing your first Rego policies to secure some of the primary AWS resources we use every day. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Hodor: Detecting and addressing overload | Bryan Barkley & Vivek Deshpande | Conf42 SRE 2022
When pushed hard enough any system will eventually suffer, and ultimately fail unless relief is provided in some form. At LinkedIn, we have developed a framework for our microservices to help with these issues: Hodor (Holistic Overload Detection & Overload Remediation). As the name suggests, it is designed to detect service overloads from multiple potential root causes, and to automatically improve the situation by dropping just enough traffic to allow the service to recover. Hodor then maintains an optimal traffic level to prevent the service from reentering overload. All of this is done without manual tuning or specifying thresholds. In this talk, we will introduce Hodor, provide an overview of the framework, describe how it detects overloads, and how requests are dropped to provide relief. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Building Openmetrics Exporter | Piyush Verma | Conf42 SRE 2022
Openmetrics-exporter - https://last9.io/openmetrics-exporter, or OME, is an Observability-as-Code framework that reduces the toil of finding-and-combining useful metrics from layers and hundreds of components involved in modern cloud-native systems. Every source, component, or metric is just a simple configuration file because the only “code” you should focus on is for your customers. It leverages plugin architecture to support data sources. It relies heavily on data frame processing to combine metrics from various metrics sources before they are all converted into Openmetrics format, ready to be piped out by a Prometheus. Traditionally, such correlation and post-processing have been the responsibility of additional Data Pipelines, but with OME, it’s as simple as writing a configuration file. At its core, OME uses Hashicorp Configuration Language (HCL) to build a DSL that can allow declarative input to build metric Pipelines. The talk is mainly about what you can solve using OME. But it also takes a concise journey of “behind-the-scenes” The need to build Openmetrics-exporter, picking a configuration language that was easily editable by humans, creating a DSL around it, and, more importantly, leveraging Golang for Data Science needs. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Exposing Log-Metrics To Prometheus With Best Practice |Samuel Arogbonlo | Conf42 SRE 2022
In this age of fast-growing advancement in cloud implementations, there is a great need to manage logs effectively. In some cases, you have to study the metrics and know what the system is about; it helps in understanding your system to take decisions, post-mortem analysis and several other interesting functions. First off, you should understand that Vector is a high-performance, end-to-end (agent & aggregator) observability data pipeline that puts you in control of your observability data. It directly orchestrates the operation of collecting, transforming and routing all your logs, metrics, and traces to any vendors you want today or tomorrow. Vector enables dramatic cost reduction, novel data enrichment, and data security where you need it, not where is most convenient for your vendors. Additionally, it is open source and up to 10x faster than every alternative in the space. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
The State of DevOps and Observability in 2022 | Dotan Horovits | Conf42 SRE 2022
There are many opinions on DevOps, open source, and observability, but what is actually being practiced? What can we learn from the collective experience of the community? We went and surveyed over 1000 engineers across the globe about their DevOps practices, challenges, and more, with special focus on enterprise observability. This session will share data and insights from the survey, with key trends (compared to previous years’ DevOps Pulse surveys), points of interest, and challenges that developers experience on a daily basis. This session will help you learn from the collective experience and emerging best practices in the community, to help guide decisions on processes, tooling and architecture choices. The survey analyzes topics such as: - What are your challenges with running Kubernetes in production? - How long does it take to troubleshoot production issues? - Which tools do you use for ticketing, event correlation and notifications? - Who is responsible for ensuring observability? - How do enterprises handle shared services? And much more. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Smoke detectors in large scale production systems | Abhijeet Mishra | Conf42 SRE 2022
Static alerting thresholds no longer cut it for modern distributed systems. With production systems scaling rapidly, using static alerting to observe critical systems is a recipe for disaster. Observability tools have recognized this need and provide a way to ""magically"" catch deviations from normal system behavior instead. - What is that magic? What goes in to deciding whether a spike or a drop is violating a known ""good"" condition or not? - How do we avoid alert fatigue? - How do you factor in seasonality - low off peak hours and high holiday traffic? The rabbit hole goes deeper than I imagined. As a part of the core data science team at Last9, I ran into scenarios where my assumptions of building anomaly detection engines were shattered and rebuilt with every interaction with production traffic. In this talk, I will talk about: - What I learnt when trying to find answers to the above questions. - How known theoretical models map to real world workloads e.g. streaming services, high frequency trading applications etc. - The science that goes behind choosing and calculating the right SLOs for different SLIs and sending out early warnings and how to measure and improve leading and lagging indicators pertaining to system health. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Kubernetes monitoring - how to improve it | Aliaksandr Valialkin | Conf42 SRE 2022
The popularity of Kubernetes changed the way how people deploy and run the software. It also brought additional complexity of Kubernetes itself, microservice architecture, short release cycles - all these became a challenge for monitoring systems. The truth is, adoption and popularity of Kubernetes had severe impact on monitoring ecosystem, on its design and tradeoffs. The talk will cover what are monitoring challenges when operating Kubernetes, such as increased metrics volume, services ephemerality, pods churn, distributed tracing, etc. And how modern monitoring solutions are designed specifically to address these challenges and at what cost. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Adding OpenTelemetry to Production Apps: Lessons Learned | Dave McAllister | Conf42 SRE 2022
Observability is increasingly important in our modern apps/cloud-native world. However, when adding observability to existing production apps, there are a number of tradeoffs in approaches and in tools. Often, these tradeoffs are an exercise in confusion, leading to decision paralysis. We took on the challenge of adding observability to NGINX MARA, investigating choices, discovering and addressing challenges while keeping to open source solutions whenever possible. You'll come away with an understanding of how the three classes of data (Metrics, Traces, Logs) work together, why we chose the solutions we used and how we extended past the normal space into health checks, introspection and core dumps. Come learn from our experience in dealing with OpenTelemetry and related tools, from traces, metrics and logs, in working with production class apps and discover what approach finally worked for us. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
One Woman Show of Migrating an Entire R&D SCM From Bitbucket to GitLab | Hila Fish | Conf42 SRE 2022
Writing code is something that we learned. Managing a project E2E - Probably not that much. In this talk, I’ll share my journey of migrating the entire R&D’s codebase from BitBucket to Gitlab on my own - But with the great help of people along the way - Planning, implementation, and handoffs. I’ll share best practices for managing a technical project with a lot of takeaways you could adopt so your project will be handled smoothly and successfully. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Using incidents to level-up your teams | Lisa Karlin Curtis | Conf42 SRE 2022
Incidents are a great opportunity to gather both context and skill. They take people out of their day-to-day roles, and force ephemeral teams together to solve unexpected and challenging problems. The first part of the talk will walk through the different things you can learn from incidents, including: - Taking you to the edges of the systems your team owns, and beyond - incidents help broaden your understanding of the context in which you're building - Showing you how systems fail, so you can learn to identify and build software with good observability, and considerations of failure modes - Expanding your network inside your organisation, making connections with different people, who you can learn from and collaborate with We'll then talk about how to get the best value from the incidents which you do have as an individual, thinking about when is an appropriate time to ask questions, and how to get your own learnings without 'getting in the way'. Finally, we'll discuss how to make this part of the culture of an organisation: as part of the leadership team, what can you do to encourage this across your teams? Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
How Static Code Analysis Prevents You From Waking Up at 3AM | Xe Iaso | Conf42 SRE 2022
Computer programming is a powerful field. You can tell the computer to do just about anything you want as long as you can describe it. The real problem comes when your intentions and what the computer understands from them differ. This talk would cover ways that static analysis tooling can prevent bad code from being sent into production with a particular focus on Go because that is the language that the speaker is the most experienced with. Waking up at 3 AM because of an obviously wrong bit of code is hitting a weird failure case and is causing downstream issues is a uniquely frustrating issue enough that it deserves to be categorically eliminated as much as possible. Static code analysis is an important part of reliability that will make it easier to make reliable systems because code that can't be put into production can't fail at 3 AM while you are trying to sleep. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Postmortem Culture at Google | Ramon Medrano Llamas | Conf42 SRE 2022
Writing postmortems after incidents and outages is an essential part of Google's SRE culture. They are blameless, widely shared internally, and allow us as an organization to maximize the insights from failures. We touch on how postmortems are written and used at Google, as well as how they can help in making decisions and driving improved reliability. We also show how you can get started with your own lightweight postmortem process. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
On Call Like a King- utilize Chaos Engineering to be a better engineer | Eran Levy | Conf42 SRE 2022
The evolution of cloud native technologies and the need to scale engineering, leading organizations to restructure their teams and embrace new architectural approaches. Being a cloud native engineer is fun! But also challenging. These days engineers aren’t just writing code and building packages but are expected to know how to write the relevant Kubernetes resource YAMLs, use HELM, containerize their app and ship it to a variety of environments. It isn't enough to know it at a high level. Being a cloud native engineer means that it’s not enough to just know the programming language you are working on well, but you should also keep adapting your knowledge and understanding of the cloud native technologies you are depending on. Engineers are now required to write services that are just one of many other services that usually solve a certain customer problem. In order to enhance engineers' cloud native knowledge and best practices to deal with production incidents, we started a series of workshops called: “On-Call like a king” which aims to enhance engineers knowledge while responding to production incidents. Every workshop is a set of chaos engineering experiments that simulate real production incidents and the engineers practice on investigating, resolving and finding the root cause. In this talk I will share how we got there, what we are doing and how it improves our engineering teams expertise. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Freedom of K8s requires Chaos Engineering to shine in production | Henrik Rexed | Conf42 SRE 2022
Like any other technology transformation, k8s adoption typically starts with small “pet projects”. One k8s cluster here, another one over there. If you don’t pay attention, you may end up like many organizations these days, something that spreads like wildfire: hundreds or thousands of k8s clusters, owned by different teams, spread across on-premises and in the cloud, some shared, some very isolated. When we start building application for k8s, we often lose sight of the larger picture on where it would be deployed and more over what the technical constraints of our targeted environment are. Sometimes, we even think that k8s is that magician that will make all our hardware constraints disappear. In reality, Kubernetes requires you to define quotas on nodes, namespaces, resource limits on our pods to make sure that your workload will be reliable. In case of heavy pressure, k8s will evict pods to remove pressure on your nodes, but eviction could have a significant impact on your end-users. How can we proactively test our settings and measure the impact of k8s events to our users? The simple answer to this question is chaos Engineering. During this presentation we will use real production stories to explain: - The various Kubernetes settings that we could implement to avoid major production outages. - How to Define the Chaos experiments that will help us to validate our settings - The importance of combining Load testing and Chaos engineering - The Observability pillars that we will help us validating our experiments Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Unleashing Deploy Velocity with Feature Flags | Travis Gosselin | Conf42 SRE 2022
A lot of development teams have built out fully automated CI/CD pipelines to deliver code to production fast! Then you quickly discover that the new bottleneck in delivering features is their existence in long-lived feature branches and no true CI is actually happening. This problem compounds as you start spinning up microservices and building features across your multi-repo architecture and coordinating some ultra-fancy release schedule so it all deploys together. Feature flags provide you the mechanism to reclaim control of the release of your features and get back to short-lived branches with true CI. However, what your not told about feature flags in those simple ""if/else"" getting started demos is that there is an upfront cost to your development time, additional complexities and some pitfalls to be careful of as you begin expanding feature flag usage to the organization. If you know how to navigate these complexities you will start to unleash true velocity across your teams. In this talk, we'll get started with some of the feature flagging basics before quickly moving into some practical feature flagging examples that demonstrate its usage beyond the basic scenarios as we talk about UI, API, operations, migrations, and experimentation. We will explore some of the hard questions around ""architecting feature flags"" for your organization. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Open Testing: What if we open our tests like we open our source? | Andrew Knight | Conf42 SRE 2022
Testing is a vibrant discipline with well-established practices, but many times, nobody but the testers who write the tests ever see them. Tests could offer so much value if they were openly shared - with developers, product owners, and perhaps even end users. So, why don’t we open our tests like we open our source? There are so many parallel benefits: helping others learn, helping teams develop higher quality software, and helping users gain confidence in the products they use. Opening tests includes sharing the tools, frameworks, and even test cases themselves. In this talk, we will look at ways a team could be more open about testing in several ways: - Breaking down barriers between folks of different roles - Embracing living documentation with specification by example - Publicly releasing test reports - Sharing test tools and frameworks as open source projects - Building and sharing fully-generic test suites based on AI/ML to run against any app Not every team may be able to open up in all these ways, but any team could still benefit from the openings that shift-left practices can bring. Open Testing could be revolutionary. Let’s make it a reality! Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch
Combat sports principles that apply to SRE | Paul Marsicovetere | Conf42 SRE 2022
As a Senior Cloud Infrastructure Engineer, I find that my life is vastly different from professional fighters and athletes competing in combat sports for many obvious reasons. However, there are some principles from the combat sports world that have an interesting application to professional life in Site Reliability Engineering (SRE). This talk will help demonstrate how these principles have helped me navigate through difficult situations in SRE as well as help any new engineers in SRE that are starting out. While there are not a lot of obvious overlaps on paper between being in combat sports and being in SRE, there is advice and guidance from those throwing punches that can help us knock-out certain SRE challenges. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk
Watch