List of videos

SRE Anti-patterns | Niladri Choudhuri | Conf42 SRE 2022

Based on my experience, I see many of the organizations have completely missed the mark. Some are doing the same as they used to do, and others are doing SRE activities in bits and pieces. In this session, Niladri will be talking about such few anti-patterns and what needs to change, why and how. Other talks at this conference πŸš€πŸͺ https://www.conf42.com/sre2022 β€” 0:00 Intro 2:13 Talk

Watch
Optimize connectivity for multi-region archs | Christian Elsen & Lerna Ekmekcioglu | Conf42 SRE 2022

Enhancing global user experience, meeting data residency requirements or ensuring business continuity are a few reasons for building multi region applications on AWS. Consistency in application performance and availability for end users can be one key consideration in multi-region architecture design. In this session, we cover how to optimize the availability and performance of end user connectivity to the multi region application on AWS. We provide best practice guidance based on the type of the application and go over practical use cases leveraging AWS Global Accelerator and Amazon Route53. We dive deep into their key features benefiting multi-region architectures along with a demo. The session is targeted for cloud teams who are looking to build performant and resilient multi region architecture on AWS for their end users including those in regulated industries where security and reliability are critical such as Financial Services and Healthcare. Other talks at this conference πŸš€πŸͺ https://www.conf42.com/sre2022 β€” 0:00 Intro 2:13 Talk

Watch
What is Data Reliability Engineering? Why it is Crucial? | Miriah Peterson | Conf42 SRE 2022

Software practitioners work to make their systems reliable. We hear teams boasting of having four or five 9s of uptime. This is not the case for Data Services. Data is not often 99.999% reliable. Systems are often out of date or out of sync. Pipelines and automated jobs fail to run. And, sometimes, the data sources are just not accurate. All these situations are examples of Data Downtime and lead to misleading results and false reporting. Data Reliability Engineering is the practice of building resilient systems. By treating data systems as an engineering problem we can borrow tools and practices from SRE to build better systems. Together let’s explore how to take this natural extension of data engineering to make our data systems stronger and more reliable. We will explore three major topics to strengthen any pipeline: - Data Downtime: We will talk about what is Data Downtime? How does it affect your bottom line? And How to minimize it? - Data Service Level Metrics: We will talk about metadata for your Data pipeline? How to report on pipeline transactions that can lead to preventative data engineering practices. - Data monitoring: What to look out for and how to be aware of system failure verse data failures. Other talks at this conference πŸš€πŸͺ https://www.conf42.com/sre2022 β€” 0:00 Intro 2:13 Talk

Watch
Alerting on SLOs and Error Budget Policies | Ricardo Castro | Conf42 SRE 2022

Assessing your system's reliability through SLOs is a great way to really understand and measure how happy users are with your service(s). Error Budgets give you the amount of reliability you have left before users are unhappy. Ideally, you want to be alerted way before users are dissatisfied and take the appropriate measures to ensure they aren't. How can you achieve that? That's where alerting on SLOs and Error Budget Policies come into the picture. By tracking how happy your users are, through SLOs, and alerting way before their level of insatisfaction reaches critical levels you’ll be able to define policies to deal with issues in a timely manner, ensuring operational excellence. Other talks at this conference πŸš€πŸͺ https://www.conf42.com/sre2022 β€” 0:00 Intro 2:13 Talk

Watch
A guide to join operational works in your new DevOps team | Kazuki Higashiguchi | Conf42 SRE 2022

"This talk will explain my experience when I join in a new engineering team. I mean a DevOps team as a team which includes developers and IT operations working collaboratively throughout the product lifecycle. Bottom-up approach Imagine, you don’t have much room to learn everything about your service at once. Therefore, let's chunk continuous learning. Learning process is observed - recorded - action. In the observe phase, let's jump into alerts and exceptions even if you’re not familiar with them. Gradually, you will start to know what are critical points for our service (i.e. website down). In the recording phase, let's write up missing documentation behalf of teammates. Even if you cannot solve issues directly, it would be helpful to your team. There, writing Runbooks give you chances to understand your service and participate in operations. I would recommend you write the ""Architecture"" part to organize your understanding of your service's technical design. In the action phase, let's try service operation (e.g. fixing broken data by manual operations). Pair operation is a good idea to jump into service operational works. Top-down approach Check the big picture of your service reliability (e.g. is there any SLA? SLO? SLI?) Other talks at this conference πŸš€πŸͺ https://www.conf42.com/sre2022 β€” 0:00 Intro 2:13 Talk

Watch
SRE Best Practices for API Design | Navendu Pottekkat | Conf42 SRE 2022

In modern development teams, site reliability engineers (SREs) are the glue that holds developer and operations teams. It is the goal of the SREs to increase the reliability of their services to meet production standards by setting up monitoring, ensuring proper resource allocation, rolling out updates gradually, and anticipating the cost of failures. As the size of the APIs increase, the need for making them reliable and robust also increases. In this talk, Navendu will talk about best practices for API design that lean towards reliability. Attendees will learn about: - Reliability issues in traditional API design. - How SREs fit in the API development pipeline. - Modern API development best practices using API gateways. - How SREs can combine DevOps practices to build more reliable services. Other talks at this conference πŸš€πŸͺ https://www.conf42.com/sre2022 β€” 0:00 Intro 2:13 Talk

Watch
LMAO Helps During Outages | Richard Lewis | Conf42 SRE 2022

This short fast pace talk is aimed towards those who might be on pager duty. Richard will cover the four things that will help you survive any outage nightmare. Cause when your company has an outage no need to sweat just remember to LMAO and you’ll get through it. Other talks at this conference πŸš€πŸͺ https://www.conf42.com/sre2022 β€” 0:00 Intro 2:13 Talk

Watch
Transformation & Cultural Shift using SRE & Data Science | Rohit Sinha | Conf42 SRE 2022

Transformation of Elephant style monolithic organization is a difficult task. As part of current role to lead the transformation to an SRE cultural is a challenge, this is a journey and as an organization we shall continue to evolve. SRE Implementation was not focused on only having SRE team for the organization but a bootstrap concept was introduced which would manage the SREs who are in each and every unit (Every team) under the guidance of Bootstrap. This has been effective to ensure penetration in day to day working of the organization and that's when the wheel starts turning. The outcome has been every tribe has a designated set up of SRE across the organization. The usage of AI/ML for infrastructure based analysis of ITSM data as well as Infrastructure log data to ensure the SREs are armed with best predictions for the infra and component to take the best decisions. This has helped in cost reduction but also helped in ensuring that the legacy tooling is slowly shaved out. This has been a data driven approach to systematic roll out of SRE set up. In house tooling culture was put as part of the release train to ensure the demand & the interest from the operations team is in tact. The tooling cannot be suddenly changed over night but had to accommodate the old and the new to ensure no service disruption. This was achieved by having an architecture which combines old and the new, ensuring the old set up of automation works while the new age DevOps based tooling and integrate API solutions are also integrated. The predication based set up would Definitely change how ITIL looks at Operations for ever as there would be no static thresholds in future and incident based set up would be gone .(More about it later ;)) Problems are plenty and especially for an organization which provides services, as revenue defines each big customer and sometimes these units which cater to a particular customer starts acting like individual companies and tries to define standards and ways of working. While SRE rollout in industry is not unique but surely the set up is unique. The usage of AI/ML for ITSM and Infra data is the uniqueness. The prediction using the logs of different parameters of the system and combining with corraltion matrix of close to 30 odd parameters of a machine . This in house built set up, combined with other parameters makes a unique experience gives SRE an edge already to perform what they are good at . The analysis is there , its about applying the principles. This was unique as a monolithic organizations try for an immediate benefit and we were able to provide it with combination of re structure in the internal services & ensuring SREs are not just introduced as a concept from top down but going to designate SRE culture at every tribe level. This was orchestrated by a bunch of best SREs ( under the SRE bootstrap) who provided constant guidance to the tribe designated SREs and also lead the bigger tooling and AI based Usage initiatives. The predication based set up would Definite change how ITIL looks at Operations for ever as there would be no static thresholds in future and incident based set up would be gone .This set up works well as then culturally, at every team level we are changing and penetrating. It had it own set of challenges and push backs which is out of learning . Modernization using SRE - The entire detailed set up is an amalgamation of SRE + usage of AIOPS based structure. While SREs part is more about following the culture and principles , the analytics based tech set up is like providing them with ammunitions to be proactive and predictive in problem solving. The above thought process covers the 2 most important principles, Monitoring (advanced & Simplicity (something which is an underlying horizontal). Emphasis is on visibility engineering, to ensure everything is seen, can be tracked, predicted and can be controlled. This is not just limited to the landscapes, ecosystem only but is being used for ITSM components as well. Process simplification has been kick started based on data analysis of why and how this time can be reduced which ties back to the another Principle, elimination of toil. New addition, usage and definition of Error Budget definition, how proper, guideline based error budget can help. Other talks at this conference πŸš€πŸͺ https://www.conf42.com/sre2022 β€” 0:00 Intro 2:13 Talk

Watch
5 Security Best Practices For Production Ready Containers | Martin Wimpress | Conf42 SRE 2022

No developer wants to be part of the morning news due to a security breach, but most application developers would rather write great code and ship cool new features than patch security issues. Thankfully, with simple best practices, developers can significantly reduce the attack surface of their containers before shipping them to production. At Slim.AI, we analyzed hundreds of container images accounting for billions of pulls annually to better understand the risks facing developers today. This talk provides highlights of our investigation and simple steps developers can take to address security issues in their containers BEFORE they get to production. Other talks at this conference πŸš€πŸͺ https://www.conf42.com/sre2022 β€” 0:00 Intro 2:13 Talk

Watch