In this blog post, you will learn how to get started with OpenTelemetry. We will go over what OpenTelemetry is and how it works. We then explore what distributed tracing is and take a look under the OpenTelemetry hood.
We will take a deep dive into OpenTelemetry, its structure, deployment methods, and some best practices.
Guide to OpenTelemetry: Contents
- What is OpenTelemetry?
- What is OpenTelemetry used for?
- The three types of telemetry
- What is distributed tracing?
- Distributed traces with Jaeger Tracing
- How does OpenTelemetry tracing work?
- The OpenTelemetry stack
- How does the OpenTelemetry SDK work?
- OpenTelemetry deployment strategies
- OpenTelemetry auto and manual instrumentation
- OpenTelemetry free course
What is OpenTelemetry?
OpenTelemetry is a set of APIs and SDKs that allow us to collect and export traces, logs, and metrics (also known as the three pillars of observability).
It is a CNCF community-driven open-source project (Cloud Native Computing Foundation, the folks in charge of Kubernetes).
OpenTelemetry enables us to instrument our cloud-native applications. Instrumenting your code means collecting telemetry data from the events that happen in your systems, which ultimately helps us understand our software’s performance and behavior.
The OpenTelemetry project is unique for 3 main reasons:
- It is open-source.
- Collects logs, metrics, and traces and acts as the glue that brings them together.
- One specification that is respected by all vendors – A standard observability framework.
You can follow the project’s status here.
OpenTelemetry operates as a single library that captures all of this information, in a high-quality unified way and under a single specification, and ships it to your dedicated location (backend, collector, etc). This means that the OpenTelemetry data you collect can be distributed to many open sources (Jaeger and Prometheus) and vendors.
It combines all three aspects needed for proper monitoring (leading with tracing, followed by metrics), and developers can use it with most modern programming languages.
As of today, many vendors are aligning to OpenTelemetry, which means:
- You can be vendor-agnostic and not tied down to a single tool or platform.
- We’re able to send OpenTelemetry data to different vendors and open sources for testing and comparison purposes with only a simple configuration change.
- Use fewer platforms and get the most out of each one.
P.S. If you got the idea and want to try OpenTelemetry yourself, here are a few language-specific guides:
What is OpenTelemetry used for?
As developers, one of the biggest aspects of our job is to respond quickly to incidents that occur in production and to resolve these issues as fast as possible.
To be able to do that, we need to gather a lot of data fast so we can understand the full picture of what’s going on in production and tackle these incidents as soon as they arise.
But as the number of services in many organizations increases, collecting the data in a way that truly helps us understand and troubleshoot our system has become a complicated task.
OpenTelemetry serves as a standard observability framework that captures all telemetry data under a single specification.
It provides several components, including:
- APIs and SDKs per programming language for generating telemetry
- The OpenTelemetry Collector; receives, processes, and exports telemetry data to different destinations.
- OTLP protocol for shipping telemetry data
OpenTelemetry is currently the standard open-source for collecting distributed traces, which eventually helps us solve issues in our system.
Distributed tracing is slowly becoming a crucial tool to pinpoint, troubleshoot and solve performance issues, errors, and more in distributed systems.
The Three Types of Telemetry
Telemetry is the data we use to monitor our applications. It’s a term meant to cover a wide range of data types, such as metrics, logs, and traces.
To troubleshoot our distributed system as fast as possible, we want to have these three data types – logs, metrics, and traces.
Logs: Logs are the trail of breadcrumbs we leave within our application to read them later and understand how the application is behaving. For example, if your application failed to write to the database, you’d be able to read that in your logs.
In distributed systems, your code is also getting distributed, and with that, your logs. You end up having a distributed trail of breadcrumbs, which is extremely difficult to follow.
Our ability to understand where a problem lies diminishes as we continue to distribute our applications. To be more specific, we lose the ability to correlate where an operation has started, where a simple request came from and the process it went through
Metrics: Metrics provide us with a high-level view of our system’s health and if it behaves within our desired boundaries.
Metrics are great at showing you when behavior has changed. However, since metrics are so high level, we only know which application is experiencing a change in behavior (e.g., database) and what metric was changed (e.g. high CPU).
We do not have the relevant information as to why it’s happening, what the root cause is, and how we can fix it.
OpenTelemtry brings us distributed tracing, a third signal that, together with logs and metrics, helps us get the complete picture.
Distributed traces: The context of what and why things are happening and the story of the communication between services. Traces allows us to visualize the progression of requests as they move throughout our system.
Having a holistic view of all three gives you the visibility you need to pinpoint problems in production and solve them as fast as possible.
What is Distributed Tracing?
Distributed tracing tells us what happens between different services and components and showcases their relationships. This is extremely important for distributed services architecture, where many issues are caused due to the failed communication between components.
Traces specify how requests are propagated through our services. It solves a lot of the gaps that we had when we relied solely on metrics and logs.
Each trace consists of spans. A span is a description of an event that occurs in our system. For example, an HTTP request or a database operation that spans over time (start at X and has a duration of Y milliseconds). Usually, it will be the parent and/or child of another span.
A trace is a tree/list of spans representing the progression of requests as it is handled by the different services and components in our system. For example, sending an API call to user-service resulted in a DB query to users-db. They are ‘call-stacks’ for distributed services.
Traces tell us how long each request took, which components and services it interacted with, and the latency introduced during each step, giving you a complete picture, end-to-end.
To learn more about the benefits of distributed tracing vs. logs, read this quick article.
Distributed Traces with Jaeger Tracing
Traces are visual instrumentation, allowing us to visualize our system to better understand the relationships between services, making it easier to investigate and pinpoint issues.
Here’s what traces look like in Jaeger Tracing – an open-source tool that allows you to visualize traces.
Let’s go over what exactly is provided to us:
The Root Request and Event Order
This tracing data “tree” explains what the root request is and what was triggered by this action.
It essentially lays out the hierarchy and order in which each request was executed.
Within this tree, we have a parent-child relationship for each span. Each child span happened because of its parent span.
- The very first request is made by the orders-service, which you can see is the very first span listed in the hierarchy. This is the “parent” span, which initiated the trace.
- The order-service then communicates with the users-service to /users/verify.
- Then the order service sends an API call to the stock service to update the stock, which leads to the stock service running a .find operation on Mongoose.
This visualization details the length of each request, when it starts and ends, and what happened in parallel or sequence. That’s super important when we’re trying to identify performance issues and optimize different operations.
Each row within the timeline graph represents a span and the corresponding request listed in the hierarchy. You can now see which request took the most time.
Each operation can produce its own logs and metrics. But by adding traces, we’re getting the full story and visual context of the whole process.
How does OpenTelemetry tracing work?
Let’s start by covering the OpenTelemetry stack.
The OpenTelemetry Stack
Generally speaking, the OpenTelemetry stack consists of 3 layers:
- Your application (in which you’d implement the OpenTelemetry SDK)
- The OpenTelemetry collector
- A visualization layer
Your application: As soon as the SDK is implemented in your application and traffic starts to flow, data from all your services will immediately be sent to the OpenTelemetry collector.
OpenTelemetry collector: Once gathered, you can choose to send the data to a dedicated location (in the image above, we are sending the data to a database).
Visualization: You can work with a third party to visualize the traces (as we did above with Jaeger).
How does the OpenTelemetry SDK work?
So now that you know what OpenTelemetry is and what the stack looks like, it’s time to take a look at what’s under the hood.
Let’s say your application has two services: service A and service B.
Service A sends an API call to service B and once that happens, service B writes to the database.
Both services have the OpenTelemetry SDK and we’re using the OpenTelemetry collector.
Once service A makes an API call to Service B, Service A also sends to the collector a span that describes the call to service B, essentially letting it know that it sent an API call (making it the parent in the trace).
There’s now a “parent / child” relationship between Service A and Service B (Service A being the parent of all the actions that follow).
OpenTelemetry injects the details about the parent span within the API call to Service B. It uses the HTTP headers to inject the trace context (trace ID, span ID, and more) into a custom header.
Once service B receives the HTTP call, it extracts the same header. From that point, any following action Service B takes is reported as its child.
All this magic happens out of the box.
This action brings another key aspect of OpenTelemetry – context propagation.
Essentially, it’s the mechanism that allows us to correlate spans across services. The context is transferred over the network using metadata such as HTTP headers.
The Header will include a trace ID, which represents the sequence of HTTP calls that were performed. It will also include a span ID, which represents the event – or span – that just took place.
Because service B has the SDK implemented as well, it will also send data to the collector, informing it that it has received an API call from service A (its parent).
The same goes for Service B’s call to the DB. It’ll create another span with the same trace ID as the initial trace ID created by service A.
OpenTelemetry Deployment Strategies
Depending on your deployment strategy, you may not need to use all aspects of the stack. When discussing deploying OpenTelemetry, there are two components we need to consider:
- The SDK: in charge of collecting the data.
- The Collector: the component responsible for how to receive, process, and export telemetry data.
For each component, you can choose between the open source-only path, the vendor path, or a combination of the two.
The first thing you should ask yourself is: am I going to work with a vendor or do I want to only work with open source solutions?
Using both Vendor and Open-Source
How to collect the data (Native SDK or Distro)
- Vendor’s Distro – This option basically works by using the vendor’s OpenTelemetry distribution (also called distro) to instrument your services.
An OpenTelemetry distribution (distro) is a customized version of an OpenTelemetry component. Essentially, it is an OpenTelmetry SDK with some customizations that are vendor or end-user-specific.
- OpenTelemetry Native SDK – if you prefer not to use a vendor for this part of the process, you can use the OpenTelemetry native SDK.
Where to send the data (Collector)
The OpenTelemetry collector receives the data that is being sent by the SDK. It then processes the data and sends it to any destination you’ve configured it to, like a database or a visualization tool.
If you choose to install the OpenTelemetry native SDK, you can send data directly to the vendor or send data to your own OpenTelemetry collector that then sends it to the vendor for visualization.
In any case, you must install some OpenTelemetry SDK in your services. It can be either the OpenTelemetry out-of-the-box SDK or a vendor distro.
Pro tip: go first with vendor-neutral unless you get significant value from using the distro.
Using pure Open-source
If you choose to go with pure open-source, you will be using the native OpenTelemetry SDK, the native OpenTelemetry collector, and an open-source visualization layer, such as Jaeger.
If you choose this path, you’ll need to run the whole stack on your own. This allows you to be extremely flexible but requires a lot of management and manpower. Everything you need to do to run a backend application, you will need to do here as well.
To dive a bit deeper into the pros and cons of using a vendor’s distro and managing a collector, read this quick guide.
OpenTelemetry Auto and Manual Instrumentation
Instrumenting: collecting data from different libraries (AWS, SDK, Redis Client) and producing spans that represent their behavior.
There are two ways spans are created – Automatic instrumentation and manual instrumentation.
Here’s a quick TL;DR
- Auto instrumentations – you don’t need to write code to collect spans
- Manual instrumentations – you do need to write code to collect spans
These are ready-to-use libraries developed by the OpenTelemetry community. They automatically create spans from your application libraries (e.g., Redis client, HTTP client).
Once you send an HTTP GET using an HTTP library, the instrumentation will automatically create a new span with the corresponding data.
Manual instrumentation works by manually adding code to your application in order to define the beginning and end of each span as well as the payload.
Pro tip: aim to use auto instrumentations – Check the OpenTelemetry registry to find all available instrumentations libraries https://opentelemetry.io/registry/
However, there are a few common use cases for using manual instrumentations:
- Unsupported auto instrumentations – Not all libraries in all languages have ready-to-use instrumentation. In that case, you want to instrument these libraries manually and even create and contribute instrumentations on your own.
- Internal libraries – many organizations create their own libraries (for various reasons), which require you to create your own instrumentation. If that is the case, here’s what you can do:
- Take inspiration from a similar open-source that has instrumentations.
- Follow the OpenTelemetry specification so that your visualization tool would work as expected.
Before you start working on manual instrumentation, you should know those manual instrumentations:
- Require good knowledge of OpenTelemetry
- Are time-consuming and hard to maintain
Learn hands-on how to use instrumentations in the OpenTelemetry Bootcamp:
- Episode 2 – How to run an auto instrumentation
- Episode 5 – How to create your own custom instrumentation.
Get Started with OpenTelemetry
In this guide, we covered the very fundamentals of OpenTelemetry. There is a lot more to learn and understand, especially if you plan on implementing OpenTelemetry in your company.
If you want to learn more, you can check out this free, 6-episode, OpenTelemetry Bootcamp (vendor-neutral).
The Bootcamp contains live coding examples that you can follow along.
It’s basically your OpenTelemetry playbook where you will learn everything, from very hands-on basics to scaling and deploying to production:
- Episode 1: OpenTelemetry Intro and Basic Deployment
- Episode 2: Integrate Your Code (logs, metrics, and traces)
- Episode 3: Deploy to Production + Collector
- Episode 4: Sampling and Dealing with High Volumes
- Episode 5: Custom Instrumentation
- Episode 6: Testing with OpenTelemetry