Tech confs

From Application to Product Ownership: an SRE Team's Journey | Nikolaus Rath | Conf42 SRE 2020

Conference: Conf42 Site Reliability Engineering (SRE) 2020

Year: 2020

Nikolaus Rath Tech Lead @ Google In the past, my team was responsible for specific applications/executables. Now, we are responsible for specific for end-user workflows, no matter which executables they involve. I will describe the technical and social changes that necessitated and enabled this change of paradigm At the end of 2019, our team supported in the order of 200 executables/microservices that provide functionality for Google's products in the advertising space. This number was the result of continuous growth over more than 10 years. Over this time, we nevertheless maintained constant operational load by making the executables as homogenous as possible, automating almost all non-emergency maintenance tasks, and routinely involving developer teams when service-specific knowledge was required. However, this growth has come at a price elsewhere. While we maintained technical expertise and overview, we lost familiarity with the user-facing products that our executables implement. For example, we could reliably assess the impact of an issue through service level indicators (SLIs) for availability and latency, but we could not immediately tell what product users were experiencing - were they without autocompletion for a text input, were they missing labels on a graph, or did an entire page become non-functional? To address this issue, we fundamentally changed the way that we operate and engage as an SRE team - without compromising on the number of services that we support. We began by redefining our team's scope based on products rather than executables. While in the past we took the pager for specific executables, we now take responsibility for products as a whole. This means we are monitoring the most important interactions with a product - no matter which executables provide them - but not necessarily all functionality that resides in any one executable. Correspondingly, we then changed SLIs to map to user interactions instead of reflecting the implementation architecture. This enabled us to describe any SLO misses in the same terms that an external user would use. Finally, we have reduced continuous work on specific binaries in favor of time-limited "deep-dives" that address specific reliability issues, while leaving day-to-day operation increasingly to application developers with SRE taking a supporting role to advise and handle large-scale incidents. — 🥇 Gold Sponsors: LightStep Google 🥈 Silver Sponsors: MayaData Aval Digital Labs Elastic The Pathwayz Group 🤝 Media Partner JetBrains — 0:00 Intro 0:35 Talk — Website 🚀🪐 https://www.conf42.com Reach out 📧📭 mark@conf42.com Conf42 Discord 🧑‍🤝‍🧑💬 https://discord.com/invite/dT6ZsFJ5ZM LinkedIn 👨‍💼💼 https://www.linkedin.com/company/49110720/ Twitter 🎵🐦https://twitter.com/conf42com Conf42Cast @ Spotify 🎧 https://tinyurl.com/bnyj6a8y