In this guide, you will learn what distributed tracing is, how it works and who uses it, why logs are not enough to debug your microservices, and what open-source OpenTelemetry has to do with it.
Distributed Tracing: Table of Contents
- What is Distributed Tracing
- How Distributed Tracing Works
- Why We Need Distributed Tracing
- Distributed Tracing Users
- When to Use Distributed Tracing
- Distributed Tracing vs. Logs
- Distributed Tracing and OpenTelemetry
- Open Source Distributed Testing Tools
- Getting Started with Distributed Tracing
- Distributed Tracing FAQs
Distributed Tracing: Introduction
As an engineer, you have probably experienced firsthand the agility microservices enable. You have probably also experienced firsthand the complexities they cause.
A few years ago, when monolith development reigned, agile development was a major challenge. You developed, you pushed code, and after a long QA and testing process, a new version was released, every few months (or years). Deployment was tracked in a single place and monitor tools audited a single application.
Now, engineering teams are developing microservices. Service division is structured according to business logic, services scale independently and new versions are being pushed out every few weeks across multiple pipelines. This has made development fast, exciting, and impactful, allowing devs to rise above the agility obstacle. But with the distribution of services came complexity. A new challenge emerged – how to find the root cause of issues, fast.
Solutions like metrics and logs are limited in their abilities when it comes to microservices. While they still provide insights about the application, they lack important information about the interactions between the services.
Engineers lack visibility into the journey that their requests and messages go through across the microservices architecture. They do not always know which components they touched, and the effect this has on their code. As a result, it is becoming harder to understand and investigate the context of errors, where they began, and how to fix them. This is where distributed tracing makes the difference.
What is Distributed Tracing
Tracing provides information about the journey requests go through in our system, end-to-end. This includes data about the services they touched, the data progression and flow between the services, and errors that occurred.
In distributed architectures, requests span multiple services. These services can operate in different environments, containers, machines, cloud infrastructures, and more. Such complexity often impacts performance, since the root cause of an issue is much harder to understand and investigate.
Distributed tracing is the solution.
We can think of traces as ‘call-stacks’ for microservices.
With tracing, you can identify how long each request took, which components and services it interacted with, and the latency created during each step.
By analyzing and investigating tracing data, engineers gain deep visibility into what happened to requests between services and their relationships. They are able to pinpoint, fix and even prevent issues and bottlenecks and optimize operations.
How Distributed Tracing Works
In a distributed architecture, communication takes place through requests that trigger operations/actions in the services. Such actions include HTTP requests, database queries, and more. Usually, a single request will trigger multiple actions across multiple services, which together make up a user journey.
A trace aggregates the information from all the actions made from a single request in the correct order. These actions are known as spans. A span that precedes another span is called a “parent span”, while a span that follows is called a “child span”.
In other words, traces track requests across services, aggregate “parent” and “child” spans in the right order, and collect and monitor data along the journey, from start to finish.
To aggregate and track the requests, each action is assigned with a trace ID. All operations that originated from the same request, are tagged with the same initial ID, as well as a unique ID.
The data that is collected about each operation includes the involved services’ names and addresses, start and end times of the operations, contextual metadata (e.g. tags) that explains the activity, and more.
After the data is collected in traces, it can be analyzed and visualized. This provides visibility into the microservices architecture and enables engineers to monitor and debug any performance issues, errors, and latency problems.
Why We Need Distributed Tracing
Distributed tracing provides multiple benefits to engineering teams:
- Insights – tracing provides data-driven, detailed insights into the system and to the journey requests go through. These insights help engineers with troubleshooting distributed services issues so they can develop higher-quality code and push it faster, with confidence.
- Visibility – engineers gain new and comprehensive visibility into the system architecture and requests journey, enabling them to see and understand how services interact and how requests are being handled across services.
- Reduced MTTR – the ability to understand request flows end-to-end helps identify bottlenecks and errors, debug existing issues faster and prevent future problems before they impact customers. This makes complex development in a microservices architecture much simpler and of higher quality, and thus faster. It also improves engineering productivity.
- Cross-team collaboration – in a microservices architecture, it is common for each service to be the responsibility of a different team. When each team can see how their requests are being handled across services that are the responsibilities of other teams, they have information to work together to fix and prevent issues.
Distributed Tracing Users
Distributed tracing can be used by:
- IT teams
Each of these users can benefit from traces, as a way to gain understanding, investigate and troubleshoot their system.
When to Use Distributed Tracing
Distributed tracing is especially impactful in distributed architectures where engineers lack visibility into the complex relationships within their services. Monolith architectures are somewhat simpler in this regard, and there are more tools for monitoring and testing them. As a result, the connections, dependencies, and correlations are clearer.
In microservices, on the other hand, engineers do not know what the impact of their changes will be. When services are distributed across environments, virtual machines, containers, etc., it is hard to pinpoint the root cause of issues or to predict the impact of pushing code. As a result, when an issue occurs, as often happens, they find themselves spending a lot of time and resources trying to understand where and why it happened.
Other data types, like logs and metrics, do not answer the same needs. They are often too high-level (metrics), too low-level and service-specific (logs), or do not monitor the relevant part of your system, to provide engineers with the answers they need.
Leverage traces to get answers to the following questions:
- What’s the impact of the code I’m pushing on other services? Will I break anything?
- What’s the root cause of this error?
- What does my architecture look like?
- Which areas in my system do I need to improve and optimize?
Distributed Tracing vs. Logs
Logs are like breadcrumbs trail we leave within our application. They provide engineers with data about events that occur in their services. Thanks to logs, engineers can be informed that an error occurred, and they can get an understanding as to why.
But in a distributed architecture, our code is getting distributed, and with that, our logs. In that way, logs complement traces. While logs provide information about activities within services, distributed traces provide information about what goes on between services, helping us put everything in context.
In addition, unlike logs, traces are:
- Automated – adding logs is a manual process, whereas trces can take place automatically, when performed with the right tools.
- Observable / Visual – seeing traces provides immediate visibility and insights into the system, whereas logs are difficult to steer through and understand.
Distributed Tracing and OpenTelemetry
OpenTelemetry is an open-source tool solution led by the CNCF (Cloud Native Computing Function). OpenTelemetry enables the automated collection and generation of traces, logs, and metrics with a single specification. Using OpenTelemetry, engineers do not have to add traces manually, attempting to find solutions for their unique architectures and data types.
Once the traces are created using OpenTelemetry, they can be exported to a visualization tool for engineers to use for monitoring and troubleshooting. These tools can be either open source, vendors or sometimes homegrown.
For distributed tracing, OpenTelemetry automatically captures the data from each service it is installed on and creates spans (that are aggregated to traces). Analyzing these OpenTelemetry-based traces in external tools provides insights into the system and services.
Learn more about OpenTelemetry:
Open Source Distributed Testing Tools
As mentioned, distributed tracing needs to be visualized so engineers can understand how their requests behaved and how services interact. The two main open-source tools that enable this are Jaeger and Zipkin. Both Jaeger and Zipkin:
- Integrate with OpenTelemetry
- Correlate data from spans
- Visualize request journeys throughout the architecture
- Provide web-based visualization
Jaeger Tracing, and mainly its UI for visualization, is the more popular of the two. To try Jaeger yourself and learn how to set it up in your system, read this complete Jaeger guide.
Getting Started with Distributed Tracing
OpenTelemetry is becoming the standard tool for tracing, which makes it a good starting point for learning and implementing traces. To get started with OpenTelemetry, check out this free, vendor-neutral OpenTelemetry Bootcamp, here.
The Bootcamp includes:
- Episode 1: OpenTelemetry Fundamentals
- Episode 2: Integrate Your Code (logs, metrics and traces)
- Episode 3: Deploy to Production + Collector
- Episode 4: Sampling and Dealing with High Volumes
- Episode 5: Custom Instrumentation
- Episode 6: Testing with OpenTelemetry
Distributed Tracing FAQs
What is distributed tracing?
Distributed tracing is the act of collecting system information in a microservices architecture about how a single request interacted with different services, from end to end. Each request triggers an action called a span. Spans include data about each action. Traces aggregate all spans in the right order.
Why is distributed tracing important?
Traces enable engineers to investigate and troubleshoot their microservices faster and with more confidence than before.
How do you implement traces?
OpenTelemetry is the open-source specification for creating traces. After implementing the OpenTelemetry SDK in your application, trace data will be automatically collected. With this data, the microservices architecture can then be visualized in an open-source or commercial tool.
What is Jaeger?
“Open source, end-to-end distributed tracing”. It is a suite of open source projects managing the entire distributed tracing “stack”: client, collector, and UI. Jaeger UI is the most commonly used open-source to visualize traces.
What is the primary use case for distributed tracing?
Distributed tracing is used for monitoring and debugging distributed systems. It allows engineers to pinpoint the root cause of issues faster and even prevent future errors and bottlenecks.
Who uses OpenTelemetry?
OpenTelemetry can be used by DevOps, SREs, IT teams, developers, and operations. They can gain observability into microservices to troubleshoot their system.
Distributed tracing is the solution for microservices visibility. Microservices enable agile development, but their complexity can impede code quality. By implementing distributed tracing, engineers can see and understand the systems they are developing, to find and fix issues.OpenTelemetry is the open-source specification for distributed tracing. It is automated and supports multiple languages and vendors. To get started with OpenTelemetry, you can learn from our free, vendor-neutral OpenTelemetry Bootcamp here.