Transformation & Cultural Shift using SRE & Data Science | Rohit Sinha | Conf42 SRE 2022

Transformation of Elephant style monolithic organization is a difficult task. As part of current role to lead the transformation to an SRE cultural is a challenge, this is a journey and as an organization we shall continue to evolve. SRE Implementation was not focused on only having SRE team for the organization but a bootstrap concept was introduced which would manage the SREs who are in each and every unit (Every team) under the guidance of Bootstrap. This has been effective to ensure penetration in day to day working of the organization and that's when the wheel starts turning. The outcome has been every tribe has a designated set up of SRE across the organization. The usage of AI/ML for infrastructure based analysis of ITSM data as well as Infrastructure log data to ensure the SREs are armed with best predictions for the infra and component to take the best decisions. This has helped in cost reduction but also helped in ensuring that the legacy tooling is slowly shaved out. This has been a data driven approach to systematic roll out of SRE set up. In house tooling culture was put as part of the release train to ensure the demand & the interest from the operations team is in tact. The tooling cannot be suddenly changed over night but had to accommodate the old and the new to ensure no service disruption. This was achieved by having an architecture which combines old and the new, ensuring the old set up of automation works while the new age DevOps based tooling and integrate API solutions are also integrated. The predication based set up would Definitely change how ITIL looks at Operations for ever as there would be no static thresholds in future and incident based set up would be gone .(More about it later ;)) Problems are plenty and especially for an organization which provides services, as revenue defines each big customer and sometimes these units which cater to a particular customer starts acting like individual companies and tries to define standards and ways of working. While SRE rollout in industry is not unique but surely the set up is unique. The usage of AI/ML for ITSM and Infra data is the uniqueness. The prediction using the logs of different parameters of the system and combining with corraltion matrix of close to 30 odd parameters of a machine . This in house built set up, combined with other parameters makes a unique experience gives SRE an edge already to perform what they are good at . The analysis is there , its about applying the principles. This was unique as a monolithic organizations try for an immediate benefit and we were able to provide it with combination of re structure in the internal services & ensuring SREs are not just introduced as a concept from top down but going to designate SRE culture at every tribe level. This was orchestrated by a bunch of best SREs ( under the SRE bootstrap) who provided constant guidance to the tribe designated SREs and also lead the bigger tooling and AI based Usage initiatives. The predication based set up would Definite change how ITIL looks at Operations for ever as there would be no static thresholds in future and incident based set up would be gone .This set up works well as then culturally, at every team level we are changing and penetrating. It had it own set of challenges and push backs which is out of learning . Modernization using SRE - The entire detailed set up is an amalgamation of SRE + usage of AIOPS based structure. While SREs part is more about following the culture and principles , the analytics based tech set up is like providing them with ammunitions to be proactive and predictive in problem solving. The above thought process covers the 2 most important principles, Monitoring (advanced & Simplicity (something which is an underlying horizontal). Emphasis is on visibility engineering, to ensure everything is seen, can be tracked, predicted and can be controlled. This is not just limited to the landscapes, ecosystem only but is being used for ITSM components as well. Process simplification has been kick started based on data analysis of why and how this time can be reduced which ties back to the another Principle, elimination of toil. New addition, usage and definition of Error Budget definition, how proper, guideline based error budget can help. Other talks at this conference 🚀🪐 https://www.conf42.com/sre2022 — 0:00 Intro 2:13 Talk