Aspecto blog

On microservices, OpenTelemetry, and anything in between

Managing the cost of OpenTelemetry

The dark side of the moon OpenTelemetry

Share this post

Right off the bat, you might find this blog post a bit strange. We are not used to managing the cost of our metrics and logs (we surely do not need someone to write a blog post about it). So what is unique about OpenTelemetry that requires a blog post for managing cost?

What to Expect

Well, to begin with, the one thing we should all know is that OpenTelemetry can be expensive. 

Quick note: even though it is costly, with the correct adjustments (as we will later see), we can make the most out of it and minimize the cost – worth it!

But what makes it potentially costly?

We can answer these questions by looking at a few differences

  • OpenTelemetry automatic nature
  • Tracing verbosity 
  • Tracing severity 

OpenTelemetry can collect a few different types of signals: logs, metrics, traces (and more to come, for example, code profiling )

How do we manage the cost for logs, metrics, and traces?

Metrics

1) Metrics provide a high-level view of our system’s health and if it behaves within our desired boundaries. They are great at showing you when behavior has changed. Metrics are cost-effective as they are a numeric summary/aggregation of what we measure.

2) Application layer metrics (those that developers manually write) are custom coded – we control their rate.

Logs

1) With logs, the cost starts to be more significant than metrics. Since logs are essentially text messages for individual events rather than aggregation, as we have more events, logs (and costs) start to pile up.

2) Usually, we control how many logs we write and their verbosity. We can quite easily comment out highly verbose log records.

3) Using log severity to control the log rate is a great option. Most production environments would not log any debug message but would log info or above.

So far, so good. Logs and metrics are good old pals of ours. We know how to manage them.

Before we move forward. If you are somewhat new to OpenTelemetry, here are the two terms you need to know for the next part:

Span: The most basic unit. A span represents an event in our system (e.g., an HTTP request or a database operation that spans over time). A span would usually be the parent of another span, its child, or both. Traces represent a tree of spans connected in a child/parent relationship.

Instrumentation – instrumentation libraries gather the data and generate spans based on different libraries in our applications (Kafka, Mongo, Gin, etc.). There are two types of instrumentations – manual and automatic.

Traces

Tracing is the troublemaker signal of the family (sorry, traces). 

1) Traces are very costly as they are mostly automated and are large in size.

2) Auto instrumentations will auto-generate spans, meaning when your service receives an HTTP call, the instrumentation automatically creates a corresponding span. As developers, you don’t need to write any line of code to make it happen, which is a tremendous value in terms of adoption, but in terms of cost, it creates a firehose of spans. 

3) Spans don’t have a severity level. Span can represent an error but not a whole list of severities. It means that you cannot choose to collect only spans that are “warn” and above, making it harder to reduce verbose spans.

How to manage OpenTelemetry cost?

So OpenTelemetry automatically creates a considerable amount of spans with no severity. What can we do to manage its cost?

Sampling tracing data is the answer we are after. So instead of paying for every fish in the pool, we choose only the fascinating fish (first time I am using a fish analogy, I swear).

I will not go into the technicalities of how it works, but in general, you have two options:

1) I want to sample X percent of the telemetry data.

In this case, all data is equal. You pick an X% out of your entire trace data. You would probably find out you are sampling the most common X% rather than the insightful ones.

2) I want to sample by rules.

For example, you want to sample 100% of traces with errors or 50% with a latency above 1 second. This option will require more work from your end but will bring better results.

Here we are getting into the complicated world of head and tail-based sampling. You can read more about it in this short guide.

If you are approaching sampling implementation of any sort and want to get more hands-on, watch episode 4 of the OpenTelemetry Bootcamp on YouTube (it has chapters). We cover a ton in there – from how to calculate the cost, tips to production deployment, and more.

I hope this helped shed some light on why OpenTelemetry can be expensive, why sampling is even a thing and the importance of getting it into the OTel discussion.

If you have any questions, feel free to reach out.

Spread the word

Subscribe for more distributed applications tutorials and insights that will help you boost microservices troubleshooting.