Right off the bat, you might find this blog post a bit strange. We are not used to managing the cost of our metrics and logs (we surely do not need someone to write a blog post about it). So what is unique about OpenTelemetry that requires a blog post for managing cost?
What to Expect
- How do we manage the cost of logs, metrics, and traces?
- How to manage OpenTelemetry cost?
- Managing OpenTelemetry Cost with Aspecto
- Implementing OpenTelemetry Sampling
Well, to begin with, the one thing we should all know is that OpenTelemetry can be expensive.
Quick note: even though it is costly, with the correct adjustments (as we will later see), we can make the most out of it and minimize the cost – worth it!
But what makes it potentially costly?
We can answer these questions by looking at a few differences
- OpenTelemetry automatic nature
- Tracing verbosity
- Tracing severity
OpenTelemetry can collect a few different types of signals: logs, metrics, traces (and more to come, for example, code profiling )
How do we manage the cost of logs, metrics, and traces?
1) Metrics provide a high-level view of our system’s health and if it behaves within our desired boundaries. They are great at showing you when behavior has changed. Metrics are cost-effective as they are a numeric summary/aggregation of what we measure.
2) Application layer metrics (those that developers manually write) are custom coded – we control their rate.
1) With logs, the cost starts to be more significant than metrics. Since logs are essentially text messages for individual events rather than aggregation, as we have more events, logs (and costs) start to pile up.
2) Usually, we control how many logs we write and their verbosity. We can quite easily comment out highly verbose log records.
3) Using log severity to control the log rate is a great option. Most production environments would not log any debug message but would log info or above.
So far, so good. Logs and metrics are good old pals of ours. We know how to manage them.
Before we move forward. If you are somewhat new to OpenTelemetry, here are the two terms you need to know for the next part:
Span: The most basic unit. A span represents an event in our system (e.g., an HTTP request or a database operation that spans over time). A span would usually be the parent of another span, its child, or both. Traces represent a tree of spans connected in a child/parent relationship.
Instrumentation – instrumentation libraries gather the data and generate spans based on different libraries in our applications (Kafka, Mongo, Gin, etc.). There are two types of instrumentations – manual and automatic.
Tracing is the troublemaker signal of the family (sorry, traces).
1) Traces are very costly as they are mostly automated and are large in size.
2) Auto instrumentations will auto-generate spans, meaning when your service receives an HTTP call, the instrumentation automatically creates a corresponding span. As developers, you don’t need to write any line of code to make it happen, which is a tremendous value in terms of adoption, but in terms of cost, it creates a firehose of spans.
3) Spans don’t have a severity level. Span can represent an error but not a whole list of severities. It means that you cannot choose to collect only spans that are “warn” and above, making it harder to reduce verbose spans.
How to manage OpenTelemetry cost?
So OpenTelemetry automatically creates a considerable amount of spans with no severity. What can we do to manage its cost?
Sampling tracing data is the answer we are after. So instead of paying for every fish in the pool, we choose only the fascinating fish (first time I am using a fish analogy, I swear).
I will not go into the technicalities of how it works, but in general, you have two options:
1) I want to sample X percent of the telemetry data.
In this case, all data is equal. You pick an X% out of your entire trace data. You would probably find out you are sampling the most common X% rather than the insightful ones.
2) I want to sample by rules.
For example, you want to sample 100% of traces with errors or 50% with a latency above 1 second. This option will require more work from your end but will bring better results.
Here we are getting into the complicated world of head and tail-based sampling. You can read more about it in this guide or keep reading for the short version.
Managing OpenTelemetry Cost with Aspecto
At Aspecto, we allow you to easily define your sampling rules without repeatedly changing your code, so you can cut your costs by sampling only the data you need. You can sample traces based on languages, libraries, routes, errors, and more.
Head-based sampling – set it up in the OpenTelemetry SDK level
Head-based sampling means making the decision to sample or not upfront, at the beginning of the trace. This is the most common way of doing sampling because of the simplicity, but since we don’t know everything in advance we’re forced to make arbitrary and limited sampling decisions.
With Aspecto, you can quickly create a new rule. For example, here we sample only 10% of health-check requests where the service name is Wikipedia-service and the environment starts with prod.
You can add other attributes such as Request Path, agent, message broker, method, and more.
To test out sampling in Aspecto, create a new free-forever account and follow the documentation.
Tail-based sampling – set it up at the OpenTelemetry Collector level
Tail-based sampling means making the decision at the end of the entire workflow, on the remaining traces after head-based sampling. With tail-based sampling, you can create advanced rules to filter out traces based on any span property, including their results, attributes, and duration.
This is how you can do that in Aspecto. For example, create a rule to sample 100% of the traces whose http.status_code is 4xx format and their duration > 15 seconds.
You can then choose to create extra conditions for your rules. For example, using errors and attribute key-value (i.e. traces that contains an attribute userId equals 12345, etc.).
Since tail-sampling is made at the Collector, there are two options for you to set it up:
- Use the Aspecto Collector: Spin it up in your environment or we can manage it for you in our backend. With this option, you will be able to remotely configure new tail-sampling rules via the Aspecto UI without code changes.
- Do it yourself: You can spin up the OpenTelemetry Collector in your environment, and then only the sampled traces will be sent to Aspecto.
For both head and tail sampling, you will have to do initial code configurations. From there, all rules and configurations can be done via the Aspecto UI.
Implementing OpenTelemetry Sampling
If you are approaching sampling implementation of any sort and want to get more hands-on, watch episode 4 of the OpenTelemetry Bootcamp on YouTube (it has chapters). We cover a ton in there – from how to calculate the cost, tips to production deployment, and more.
I hope this helped shed some light on why OpenTelemetry can be expensive, why sampling is even a thing, and the importance of getting it into the OTel discussion.
If you have any questions, feel free to reach out.