Monitorama PDX 2024 - Tracing Service Dependencies at Salesforce
Sudeep Kumar's session from Monitorama PDX 2024. This talk will focus on our strategic choice to utilize a streaming pipeline for inferring service dependencies using Trace telemetry data. We'll also delve into a key use case that showcases how service dependencies are visualized and managed through the streaming pipeline on our distributed tracing platform. These service dependency views are crucial for monitoring the deployment status of services and the health of related services. Moreover, by providing a clear overview of service interactions, we also facilitate risk assessment for new feature rollouts, enhancing both product development and operational stability. Additionally, discuss the role of integrating service mesh onto Kubernetes led to comprehensive coverage and completeness of our service dependency data. Also In our talk, we'll will explore how a Flink based streaming platform and Druid backend is utilized to gather all trace telemetry data, enabling us to process 100% of trace data (from 300 millions spans collected per minute) and deduce complete trace contexts. By establishing unique trace contexts, we create a trace state that represents every request occurring within the system. This state (Dependency edges) contains vital information required to map out the path of transactions as they move through different services and components within Salesforce. Our proposed talk will delve into the transformative process of converting individual trace states into service dependency edge records through Flink and Druid, revealing the complex web of interactions between services. Attendees will be equipped with methods to uncover key interactions, such as identifying the services or operations that most frequently initiate contact with other services. Furthermore, we will explore strategies for utilizing service dependency topology to achieve a thorough grasp of the relationships and dependencies among services and components in a distributed system. Adding to the benefits, the audience will learn how having service mesh coverage on any Kubernetes infrastructure can be leveraged effectively for accurately deducing service dependencies. This aspect underscores the importance of infrastructure design in enhancing traceability and reliability within distributed systems. Armed with this understanding, participants will be better positioned to enhance system performance and reliability. The session aims to provide attendees with actionable insights and methodologies for effectively managing and navigating the intricate service dependencies that characterize modern distributed systems.