OpenTelemetry Sampling: Introduction
So you’re using OpenTelemetry to gain insights into your microservices’ interactions.
To gain those insights, you need to store the created traces and spans somewhere, and later on, have some form of visualization.
It may be a cloud-hosted Jaeger or some observability vendor.
In either case, you would most likely pay more for more data, which leads us to the main and most obvious motivation for using sampling: the cost.
If you decide to not sample, you risk incurring heavy charges from the relevant cloud provider/vendor.
In terms of performance: the unsampled spans are being collected but not being sent, thus reducing the load of sending them in the network and some CPU/memory of the process that sends them as a result.
What to Expect
- What is sampling?
- OpenTelemetry sampling: When and where
- OpenTelemetry Samplers
- Managing OpenTelemetry Cost with Aspecto
- Wrap up: Which sample rate should you choose
So what does it mean when we say sampling?
It means to make a decision not to save all the traces that are created by your application, with the assumption that only some of them are enough to understand patterns in your services and gain better insights.
As for the motivation: the main motivation for using it is either to save costs(as mentioned above) or when you have some data that is clutter that you want out of your way.
In this article, I will give you all the information I think you need to know about sampling, even if you start without any knowledge about it.
OpenTelemetry Sampling: When and Where
Let’s talk a bit about when (and where) we can make the decision to sample. There are 2 approaches – Head-Based Sampling and Tail-Based Sampling.
I will review them both.
As the name suggests, head-based sampling means to make the decision to sample or not upfront, at the beginning of the trace.
This is the most common way of doing sampling today because of the simplicity, but since we don’t know everything in advance we’re forced to make arbitrary decisions (like a random percentage of all spans to sample) that may limit our ability to understand everything. This is done at the OTEL distro level.
A disadvantage of head-based sampling is the fact that you can’t decide that you want to sample only spans with errors since you do not know this in advance (the decision to sample or not happens before the error happened).
Contrary to head-based sampling, here we make the decision at the end of the entire flow, when we already gathered the data. This type of sampling is done at the collector (the backend that receives all the spans) level.
This can be useful for metrics, for example, when we want to gather the latency, we must know the exact start and end times which cannot be done in advance.
Also, what was a disadvantage of the head-based is an advantage for tail-based – being able to only sample spans with errors in them.
More information about the OTEL collector’s tail sampling processor can be found here.
So where should sampling be implemented?
Well, that depends on your specific use case so there is no one solution that fits all.
If you choose to do it at the OTEL distro level, you remove redundant data at the source, never needing to worry about it again. You also minimize data transported in the network.
However, when you need to update the sample rate you have to redeploy your services each time.
If you implement it in the collector you have a centralized place that controls sampling so you don’t need to redeploy your server when you change your sample rate.
However, making the sampling decision requires buffering the data until a decision can be made and thus adds overhead.
At Aspecto, we took the best of both worlds, sampling at the SDK level but allowing our users to control the sample rate from the user interface, eliminating the need for redeploying the services on each change.
So the OpenTelemetry way of implementing the actual sampling is through “Samplers”.
A sampler’s purpose is to lower the load (or the amount of created traces), and to reduce costs.
There are a few types of samplers, and we will discuss some of them here:
As its name suggests, it essentially means to not sample – and take 100% of the spans. In a perfect world, we would use this only, without any cost considerations.
If you’re reading this since we’re most likely not in a perfect world yet, so let’s move on to the next one.
Also as the name suggests, the AlwaysOff sampler samples 0% of the spans. This means that no data will be collected whatsoever. You probably won’t be using this one much, but it could be useful in certain cases.
For example, when you run load tests and don’t want to store the traces created by them.
Parent Based Sampler
This is the most popular sampler and is the one recommended by the official OpenTelemetry documentation.
When a trace begins we make a decision whether or not to sample it. Whatever the decision, it is bubbled down inside the process and in other processes(for example in other services).
There is a notable advantage to this: you always get the complete picture.
How this works: for the first span which is the root of the trace, we decide whether or not it will be sampled.
The decision is bubbled through the rest of the child spans in this trace via context propagation, making each child know if it needs to be sampled or not.
It is important to understand that this is a composite sampler, meaning it does not live on its own but lets us define how to sample for each use case. For example – you can define what to do when we have no parent by using the root sampler. You can define what to do when we have a remote parent / local parent with different samplers.
See more about this in the official specification.
As of writing these lines, the documentation and specification mentioned above recommend using the parent-based sampler with TraceIDRatioBased sampler as root sampler.
TraceIDRatioBased based sampler uses the trace ID to calculate whether or not the trace should be sampled or not, with respect to the sample rate we choose.
But be warned: the specific algorithm has not yet been specified in the specification, so different implementations may produce a different result for the same input (see here).
In terms of performance – even if you decide to use 0% sampling, there’s a minimal overhead because a span is created anyway but is not sent. This is done for the propagation of the context.
Managing OpenTelemetry Cost with Aspecto
You can cut your tracing costs with Aspecto remote sampling configuration without changing your code. Easily sample traces based on languages, libraries, routes, errors, and more.
Remote Head-based sampling
Set it up in the OpenTelemetry SDK level. This is how you can do that in Aspecto. For example, create a rule to sample 100% of the traces whose http.status_code is 4xx format and their duration > 15 seconds.
You can add other attributes such as Request Path, agent, message broker, method, and more.
To test out sampling in Aspecto, create a new free-forever account and follow the documentation.
Remote Tail-based sampling
Set it up at the Collector level. This is how you can do that in Aspecto – For example, create a rule to sample 100% of the traces whose http.status_code is 4xx format and their duration > 15 seconds.
You can then choose to create extra conditions for your rules. For example, using errors and attribute key-value (i.e. traces that contains an attribute userId equals 12345, etc.).
Since tail-sampling is set up at the Collector, there are two options for you to set it up:
- Use the Aspecto Collector: Spin it up in your environment or we can manage it for you in our backend. With this option, you will be able to remotely configure new tail-sampling rules via the Aspecto UI without code changes.
- Do it yourself: You can spin up the OpenTelemetry Collector in your environment, and then only the sampled traces will be sent to Aspecto.
Wrapping Up: which sample rate should you choose?
This is a very popular question we run into. Unfortunately, there’s no magic number here either.
It really depends on your budget, the way your services are built, and the amount of traffic you have.
If you have a very “noisy” endpoint with a lot of traffic – you want to set a small percentage, to avoid the cost and the noise.
If you have a core endpoint with not a lot of traffic you want to set a high percentage since it won’t cost much but most likely be valuable.
But the bottom line is this: you need to make a wise guess and keep adjusting it until you reach the sweet spot that you’re satisfied with.
That would be it for today, I hope you now have a better understanding of why and how to use sampling.
If something is not clear or you have any questions, feel free to contact me with questions via Twitter DMs @thetomzach.
To dig even deeper into sampling (and deploying to production) watch our 4th episode of the OpenTelemetry Bootcamp
Tom Zach is a Software Engineer at Aspecto. Feel free to follow him on Twitter for more great articles like this one: How to Use OpenTelemetry to Improve Your Integration Tests.