distributed tracing system design

Remember, establish ground truth, then make it better! While its easy to view logs for a particular Trace ID across all services, the Zipkin UI provides a summarized view of the duration of each call without having to look through hundreds of log statements. When the RPC call reaches the server, the processor will identify and note whether the incoming call has tracing data so it can respond appropriately. With these tags in place, aggregate trace analysis can determine when and where slower performance correlates with the use of one or more of these resources. To make the TDist integrations with our existing services easier and less error-prone, we relied on Guice and implemented several modules that our clients simply had to install. This, in turn, lets you shift from debugging your own code to provisioning new infrastructure or determining which team is abusing the infrastructure thats currently available. Observability lets you understand why something is wrong, compared with monitoring, which simply tells you when something is wrong. What happened? The idea of straining production systems with instrumentation data made us nervous. Learn more about New Relics support forOpenTelemetry,OpenCensus, andIstio. What Amdahl's Law tells us here is that focusing on the performance of operation A is never going to improve overall performance more than 15%, even if performance were to be fully optimized. It covers the key distributed data management patterns including Saga, API Composition, and CQRS. Because of this we can query for logs across all of the trace-enabled services for a particular call. Not having to maintain a custom compiler lowered our development cost significantly. What is the root cause of errors and defects within a distributed system? The time and resources spent building code to make distributed tracing work was taking time away from the development of new features. Head-based sampling: Where the decision to collect and store trace data is made randomly while the root (first) span is being processed. The previous blog post talked about why Knewton needed a distributed tracing system and the value it can add to a company. At the time of implementation, Kinesis was a new AWS service and none of us were familiar with it. Planning optimizations: How do you know where to begin? Perhaps the most common cause of changes to a services performance are the deployments of that service itself. My virtual bootcamp, distributed data patterns in a microservice architecture, is now open for enrollment! Throughout the development process and rolling out of the Zipkin infrastructure, we made several open-source contributions to Zipkin, thanks to its active and growing community. The data stay there for a configurable time and are queried by the Zipkin query service to display on the UI. Jetty services requests by routing them to a Servlet. Upon receipt of a request (or right before an outgoing request is made), the tracing data are added to an internal queue, and the name of the thread handling the request is changed to include the Trace ID by a DataManager. This also meant that our clients never had to instantiate any of our tracing-enabled constructs. Lightstep automatically surfaces whatever is most likely causing an issue: anything from an n+1 query to a slow service to actions taken by a specific customer to something running in sequence that should be in parallel. It makes it easy to use the Saga pattern to manage transactions and the CQRS pattern to implement queries. start time, end time) about the requests and operations performed when handling a external request in a centralized service, It provides useful insight into the behavior of the system including the sources of latency, It enables developers to see how an individual request is handled by searching across, Aggregating and storing traces can require significant infrastructure. The drawback is that its statistically likely that the most important outliers will be discarded. And because we didnt want other teams at Knewton incurring the cost of this upgrade, the distributed tracing team had to implement and roll out all the changes. It lets all tracers and agents that conform to the standard participate in a trace, with trace data propagated from the root service all the way to the terminal service. This section will go into more technical detail as to how we implemented our distributed tracing solution. There are two approaches to sampling distributed traces: Child span: Subsequent spans after the root span. We put a lot of thought into how we laid out our Guice module hierarchies so that TDist didnt collide with our clients, and we were very careful whenever we had to expose elements to the outside world. The tracing data store is where all our tracing data ends up. Proactive solutions with distributed tracing. Each thread servicing or making a request to another service gets assigned a Span that is propagated and updated by the library in the background. How to understand the behavior of an application and troubleshoot problems? New Relic gave us all the insights we neededboth globally and into the different pieces of our distributed application. We felt this was the ideal place to deal with tracing data. W3C Trace Context is becoming the standard for propagating trace context across process boundaries. These symptoms can be easily observed, and are usually closely related to SLOs, making their resolution a high priority. Another hurdle was that certain services, such as the Cassandra client library Astyanax depended on third-party libraries that in turn depended on the Thrift 0.7.0. It can help map changes from those inputs to outputs, and help you understand what actions you need to take next. Our tracing solution at Knewton has been in all environments for a few months now. Visit our website tosign up for accesstoday. The first approach involved a modified Thrift compiler, and the second involved modified serialization protocols and server processors. We had a lot of fun implementing and rolling out tracing at Knewton, and we have come to understand the value of this data. Why Does Your Business Need Distributed Tracing? One common insight from distributed tracing is to see how changing user behavior causes more database queries to be executed as part of a single request. Get more value from your data with hundreds of quickstarts that integrate with just about anything. Knewton built the tracing library, called TDist, from the ground up, starting as a company hack day experiment. Your team has been tasked with improving the performance of one of your services where do you begin? According to section 5 of rfc2047, the only guideline for adding custom headers is to prefix them with a `X-`. Modern software architectures built on microservices and serverless introduce advantages to application development, but theres also the cost of reduced visibility. A single trace typically captures data about: Collecting trace data would be wasted if software teams didnt have an easy way to analyze and visualize the data across complex architectures. In general, distributed tracing is the best way for DevOps, operations, software, and site reliability engineers to get answers to specific questions quickly in environments where the software is distributedprimarily, microservices and/or serverless architectures. Eventuate is Chris's latest startup. Heres a diagram showing how the payload is modified to add tracing data: When we were adding tracing support to Kafka, we wanted to keep the Kafka servers, also referred to as brokers, as a black box. Tail-based sampling, where the sampling decision is deferred until the moment individual transactions have completed, can be an improvement. It uses distributed tracing and other telemetry data to gain full visibility into its data-ingestion pipeline, which collects 1 billion data points every day. The Span ID may or may not be the same as the Trace ID. By themselves, logs fail to provide the comprehensive view of application performance afforded by traces. Whenever a TDist client forgets to bind something, Guice would notify our clients at compile time. By being able to visualize transactions in their entirety, you can compare anomalous traces against performant ones to see the differences in behavior, structure, and timing. See code. Unlike head-based sampling, were not limited by decisions made at the beginning of a trace, which means were able to identify rare, low-fidelity, and intermittent signals that contributed to service or system latency. However, software teams discovered that instrumenting systems for tracing then collecting and visualizing the data was labor-intensive and complex to implement. This instrumentation might be part of the functionality provided by a Microservice Chassis framework. As part of an end-to-end observability strategy, distributed tracing addresses the challenges of modern application environments. Kinesis seemed like an attractive alternative that would be isolated from our Kafka servers, which were only handling production, non-instrumentation data. For example, theres currently no way to get aggregate timing information or aggregate data on most called endpoints, services etc. It is written in Scala and uses Spring Boot and Spring Cloud as the Microservice chassis. Second, open standards for instrumenting applications and sharing data began to be established, enabling interoperability among different instrumentation and observability tools. In other words, we wanted to pass the data through the brokers without them necessarily knowing and therefore not having to modify the Kafka broker code at all. And even with the best intentions around testing, they are probably not testing performance for your specific use case. Because distributed tracing surfaces what happens across service boundaries: whats slow, whats broken, and which specific logs and metrics can help resolve the incident at hand. After the data is collected, correlated, and analyzed, you can visualize it to see service dependencies, performance, and any anomalous events such as errors or unusual latency. The biggest disadvantage to customizing protocols and server processors was that we had to upgrade to Thrift 0.9.0 (from 0.7.0) to take advantage of some features that would make it easier to plug in our tracing components to the custom Thrift processors and protocols. Its not as fast as Kafka, but the nature of our data made it acceptable to have an SLA of even a few minutes from publication to ingestion. Span ID: The ID for a particular span. As a service owner your responsibility will be to explain variations in performance especially negative ones. To achieve this, we require clients to wrap their serializers/deserializers in tracing equivalents that delegate reading and writing of the non-tracing payload to the wrapped ones. Distributed tracing starts with instrumenting your environment to enable data collection and correlation across the entire distributed system. When any incoming request comes with tracing data headers, we construct span data from it and submit it to the DataManager. The first step is going to be to establish ground truths for your production environments. We were considering Kafka because Knewton, has had a stable Kafka deployment for several years. The Microservices Example application is an example of an application that uses client-side service discovery. This lets your distributed tracing tool correlate each step of a trace, in the correct order, along with other necessary information to monitor and track performance. For Astyanax, we had to shade the JARs using Maven and change package names so that they didnt collide with the newer Thrift library. Want to see an example? Instrumenting your microservices environment means adding code to services to monitor and track trace data. You can learn more about the different types of telemetry data in MELT 101: An Introduction to the Four Essential Telemetry Data Types. As soon as a handful of microservices are involved in a request, it becomes essential to have a way to see how all the different services are working together. Traditional log aggregation becomes costly, time-series metrics can reveal a swarm of symptoms but not the interactions that caused them (due to cardinality limitations), and naively tracing every transaction can introduce both application overhead as well as prohibitive cost in data centralization and storage. Which services have problematic or inefficient code that should be prioritized for optimization. Sampling: Storing representative samples of tracing data for analysis instead of saving all the data. In August, Ill be teaching a brand new public microservices workshop over Zoom in an APAC-friendly (GMT+9) timezone. Were creators of OpenTelemetry and OpenTracing, the open standard, vendor-neutral solution for API instrumentation. Sometimes its internal changes, like bugs in a new version, that lead to performance issues. While logs have traditionally been considered a cornerstone of application monitoring, they can be very expensive to manage at scale, difficult to navigate, and only provide discrete event information. This dynamic sampling means we can analyze all of the data but only send the information you need to know. Projects such asOpenCensusandZipkinare also well established in the open source community. A comprehensive observability platform allows your teams to see all of their telemetry and business data in one place. This means tagging each span with the version of the service that was running at the time the operation was serviced. Scales rapidly and seamlessly to meet increased future demand, Improves engineering efficiency and customer transparency, What Full-Stack Observability Requires Today, 2008-22 New Relic, Inc. All rights reserved, Introduction: Cutting Through the Complexity. Conventional distributed tracing solutions will throw away some fixed amount of traces upfront to improve application and monitoring system performance. Child spans can be nested. All the planning in the world wont lead to perfect resource provisioning and seamless performance. Distributed tracing refers to methods of observing requests as they propagate through distributed systems. But this is only half of distributed tracings potential. A separate set of query and web services, part of the Zipkin source code, in turn query the database for traces. Chris helps clients around the world adopt the microservice architecture through consulting engagements, and training classes and workshops. With the insights of distributed tracing, you can get the big picture of your services day-to-day performance expectations, allowing you to move on to the second step: improving the aspects of performance that will most directly improve the users experience (thereby making your service better!). Ready to get started now? A typical server will have server and client code, with the server code often depending on other client libraries. In this approach, we experimented with modifying the C++ Thrift compiler to generate additional service interfaces that could pass along the tracing data to the user. All of our services use it to enable tracing. Track requests across services and understand why systems break. Having visibility into your services dependencies behavior is critical in understanding how they are affecting your services performance. Engage Chris to create a microservices adoption roadmap and help you define your microservice architecture. To make the trace identifiable across all the different components in your applications and systems, distributed tracing requires trace context. There are open source tools, small business and enterprise tracing solutions, and of course, homegrown distributed tracing technology. Put all over the place in its placemonitor your entire stack on a single platform. [As] we move data across our distributed system, New Relic enables us to see where bottlenecks are occurring as we call from service to service., Muhamad Samji,Architect, Fleet Complete. We also soon realized that allowing the user access to the tracing data might not be desirable or safe, and data management might be better left to TDist for consistency. In the next section, we will look at how to start with a symptom and track down a cause. The Knewton Blog - Stories about technology, product and design at Knewton, How Low Code Development Platform is Transforming the Software Industry. The tracing message bus is where all our client services place tracing data prior to its being consumed by the Zipkin collector and persisted. Our initial estimates for putting us in the range of over 400,000 tracing messages per second with only a partial integration. Our Thrift solution consisted of custom, backwards-compatible protocols and custom server processors that extract tracing data and set them before routing them to the appropriate RPC call. For instance, a request might pass through multiple services and traverse back and forth through various microservices to reach completion. Thrift is the most widely used RPC method between services at Knewton. It also tells Spring Cloud Sleuth to deliver traces to Zipkin via RabbitMQ running on the host called rabbitmq. The next few examples focus on single-service traces and using them to diagnose these changes. Chris teaches comprehensive workshops, training classes and bootcamps for executives, architects and developers to help your organization use microservices effectively. However, we still had to release all Knewton services before we could start integrating them with our distributed tracing solution. Engineering organizations building microservices or serverless at scale have come to recognize distributed tracing as a baseline necessity for software development and operations. Whether youre a business leader, DevOps engineer, product owner, site reliability engineer, software team leader, or other stakeholder, you can use this ebook to get a quick introduction into what distributed tracing is all about, how it works, and when your teams should be using it. Read the white paper Gain an Edge with Distributed Tracing. Spoiler alert: its usually because something changed. Notice that the Trace ID is consistent throughout the tree. With the Apache HTTP Client, we use an HttpRequestInterceptor and HttpResponseInterceptor, which were designed to interact with header contents and modify them. Overall, weve been satisfied with its performance and stability. You have applied the Microservice architecture pattern. TDist currently supports Thrift, HTTP, and Kafka, and it can also trace direct method invocations with the help of Guice annotations. As part of this routing, Jetty allows the request and response to pass through a series of Filters. When reading a message, the protocols will extract the tracing data and set them to a ThreadLocal for the thread servicing the incoming RPC call using the DataManager. Still, that doesnt mean observability tools are off the hook. However, the downside, particularly for agent-based solutions, is increased memory load on the hosts because all of the span data must be stored for the transactions that are in-progress.. Lightstep aims to help people design and build better production systems at scale. Then two things happened: First, solutions such as New Relic began offering capabilities that enable companies to quickly and easily instrument applications for tracing, collect tracing data, and analyze and visualize the data with minimal effort. We experimented with Cassandra and DynamoDB, mainly because of the institutional knowledge we have at Knewton, but ended up choosing Amazons Elasticache Redis. Both of these projects allow for easy header manipulation. When we started looking into adding tracing support to Thrift, we experimented with two different approaches. Observability creates context and actionable insight by, among other things, combining four essential types of observability data: metrics, events, logs, and traces. Tracesmore precisely, distributed tracesare essential for software teams considering a move to (or already transitioning to) the cloud and adopting microservices. Once a symptom has been observed, distributed tracing can help identify and validate hypotheses about what has caused this change. This means assigning a unique ID to each request, assigning a unique ID to each step in a trace, encoding this contextual information, and passing (or propagating) the encoded context from one service to the next as the request makes its way through an application environment. That request is distributed across multiple microservices and serverless functions. New Relic supports the W3C Trace Context standard for distributed tracing. At other times its external changes be they changes driven by users, infrastructure, or other services that cause these issues. At the time, our Kafka cluster, which weve been using as our student event bus, was ingesting over 300 messages per second in production. In addition, traces should include spans that correspond to any significant internal computation and any external dependency. Combining traces with the other three essential types of telemetry datametrics, events, and logs (which together with traces create the acronym MELT)gives you a complete picture of your software environment and performance for end-to-end observability. Upgrading libraries when using a dependency framework is relatively easy, but for an RPC framework like Thrift and a service-oriented architecture with a deep call dependency chain, it gets a little more complicated. They provide various capabilities including Spring Cloud Sleuth, which provides support for distributed tracing. Thrift appends a protocol ID to the beginning, and if the reading protocol sees that the first few bytes do not indicate the presence of tracing data the bytes are put back on the buffer and the payload is reread as a non-tracing payload. Fleet Complete is the fastest-growing telematics provider in the world, serving more than 500,000 subscribers and 35,000 businesses in 17 countries, while experiencing tenfold growth in the past several years. However, we would have had to recompile all of our Thrift code and deviate from the open-source version, making it harder to upgrade in the future. Tags should capture important parts of the request (for example, how many resources are being modified or how long the query is) as well as important features of the user (for example, when they signed up or what cohort they belong to). The answer is observability, which cuts through software complexity with end-to-end visibility that enables teams to solve problems faster, work smarter, and create better digital experiences for their customers. Now that you understand how valuable distributed tracing can be in helping you find issues in complex systems, you might be wondering how you can learn more about getting started. And unlike tail-based sampling, were not limited to looking at each request in isolation: data from one request can inform sampling decisions about other requests. Check out Chris Richardson's example applications. How can your team use distributed tracing to be proactive? Our protocols essentially write the tracing data at the beginning of each message. Experienced software architect, author of POJOs in Action, the creator of the original CloudFoundry.com, and the author of Microservices patterns. Finding these outliers allowed us to flag cases where we were making redundant calls to other services that were slowing down our overall SLA for certain call chains. This allows you to focus on work that is likely to restore service, while simultaneously eliminating unnecessary disruption to developers who are not needed for incident resolution, but might otherwise have been involved. Our solution has two main parts: the tracing library that all services integrate with, and a place to store and visualize the tracing data. Spans represent a particular call from client start through server receive, server send, and, ultimately, client receive. These are changes to the services that your service depends on. Several different teams own and monitor the various services that are involved in the request, and none have reported any performance issues with their microservices. This was quite simple, because HTTP supports putting arbitrary data in headers. Remember, your services dependencies are just based on sheer numbers probably deploying a lot more frequently than you are. ), it is important to ask yourself the bigger questions: Am I serving traffic in a way that is actually meeting our users needs? Ready to start using the microservice architecture? For our solution, we chose to match the data model used in Zipkin, which in turn borrows heavily from Dapper. Before we dive any deeper, lets start with the basics. For those unfamiliar with Guice, its a dependency management framework developed at Google. OpenTelemetry, part of theCloud Native Computing Foundation (CNCF), is becoming the one standard for open source instrumentation and telemetry collection. Observing microservices and serverless applications becomes very difficult at scale: the volume of raw telemetry data can increase exponentially with the number of deployed services. Time to production, given that we didnt have to roll out and maintain a new cluster, easier integration with Zipkin with less code. Lightstep was designed to handle the requirements of distributed systems at scale: for example, Lightstep handles 100 billion microservices calls per day on Lyfts Envoy-based service architecture. A quick guide to distributed tracing terminology. The last type of change we will cover are upstream changes. As above, its critical that spans and traces are tagged in a way that identifies these resources: every span should have tags that indicate the infrastructure its running on (datacenter, network, availability zone, host or instance, container) and any other resources it depends on (databases, shared disks). Conventionally, distributed tracing solutions have addressed the volume of trace data generated via upfront (or head-based) sampling. We elected to continue the Zipkin tradition and use the following headers to propagate tracing information: Services at Knewton primarily use the Jetty HTTP Server and the Apache HTTP Client. Tracing tells the story of an end-to-end request, including everything from mobile performance to database health. Its a diagnostic technique that reveals how a set of services coordinate to handle individual user requests. The most important reasons behind our decision were. Ben Sigelman, Lightstep CEO and Co-founder was one of the creators of Dapper, Googles distributed tracing solution. And isolation isnt perfect: threads still run on CPUs, containers still run on hosts, and databases provide shared access. Most of our services talk to each other through this framework, so supporting it while still maintaining backwards compatibility was critical for the success of this project. Zipkin supports a lot of data stores out of the box, including Cassandra, Redis, MongoDB, Postgres and MySQL. A tracing protocol can detect whether the payload contains tracing data based from the first few bytes. The upgrade required a lot of coordination across the organization. This means that you should use distributed tracing when you want to get answers to questions such as: As you can imagine, the volume of trace data can grow exponentially over time as the volume of requests increases and as more microservices are deployed within the environment. So, while microservices enable teams and services to work independently, distributed tracing provides a central resource that enables all teams to understand issues from the users perspective. Teams can manage, monitor, and operate their individual services more easily, but they can easily lose sight of the global system behavior. As mentioned above, the thread name of the current thread servicing a request is also changed, and the trace ID is appended to it. If throughout this article you have been thinking that integrating with TDist sounds complicated, a lot of the time all our clients needed to do was install additional Guice modules that would bind our tracing implementations to existing Thrift interfaces. Its also a useful way of identifying the biggest or slowest traces over a given period of time. Because of this, upgrades needed to start from the leaf services and move up the tree to avoid introducing wire incompatibilities since the outgoing services might not know if the destination service will be able to detect the tracing data coming through the wire. Service X is down. Simply by tagging egress operations (spans emitted from your service that describe the work done by others), you can get a clearer picture when upstream performance changes.