OpenTelemetry-go makes it hard to migrate to exponential histograms

github.com/prometheus/client_golang has a nice feature where a histogram can simultaneously be both an exponential (native) histogram and a classic (explicit bucket) one. Then, depending on what formats are accepted by the scraper, either both the native histogram version and the classical one are exposed, or just the classical one (text format). There has been an initiative to expose native histograms in text format, but it hasn’t yet solidified.

Unfortunately, in the OpenTelemetry standards, a histogram at any point can only have either explicit buckets or exponential buckets. From https://opentelemetry.io/docs/specs/otel/metrics/data-model/#model-details:

In this low-level metrics data model, a Timeseries is defined by an entity consisting of several metadata properties:

  • Metric name
  • Attributes (dimensions)
  • Value type of the point (integer, floating point, etc)
  • Unit of measurement

The primary data of each timeseries are ordered (timestamp, value) points, with one of the following value types:

  • Counter (Monotonic, Cumulative)
  • Gauge
  • Histogram
  • Exponential Histogram

Unfortunately, native (exponential) and classic (explicit bucket) expose different metric names even though they have the same metric family. An explicit bucket histogram exposes these extra metrics for each histogram:

  • METRICNAME_bucket
  • METRICNAME_sum
  • METRICNAME_count

Whereas with a native histogram, you would only get METRICNAME. This means that when switching to exponential histograms, you will have to create what is called “a view” of each histogram that you want to convert. The code looks as follows:

package main

import (
	"go.opentelemetry.io/otel/sdk/instrumentation"
	"go.opentelemetry.io/otel/sdk/metric"
)

func main() {
	exponentialView := metric.NewView(
		metric.Instrument{
			Name:  "latency",
			Scope: instrumentation.Scope{Name: "http"},
		},
		metric.Stream{
		        Name: "exponentiallatency",
			Aggregation: metric.AggregationBase2ExponentialHistogram{
				MaxSize:  160,
				MaxScale: 20,
			},
		},
	)
	explicitView := metric.NewView(
		metric.Instrument{
			Name:  "latency",
			Scope: instrumentation.Scope{Name: "http"},
		},
		metric.Stream{
		        Name: "explicit",
			Aggregation: metric.AggregationExplicitBucketHistogram{
					Boundaries: []float64{0, 1, 2, 5, 10},
			},
		},
	)
	

	// The created view can then be registered with the OpenTelemetry metric
	// SDK using the WithView option.
	_ = metric.NewMeterProvider(
		metric.WithView(exponentialView, explicitView),
	)
}

(from https://pkg.go.dev/go.opentelemetry.io/otel/sdk/metric)

What is worse is that this API forces you to redefine the boundaries in the view, even though they could be (and are) defined near the histogram call, where they are created.

So, it seems to me like the suggestion would be to use just AggregationBase2ExponentialHistogram and then after deploying these changes, go rush to change all your alerting/recording rules + dashboards.

Perils of /metrics data assertions

In the Thanos project the e2e tests are written in part using assertions on the metrics data. I have encountered a few challenges with those assertions that I wanted to share with you.

Always check for the maximum possible value

Writing tests on /metrics data means your testing harness continuously checks /metrics data to see whether some metric equals some value. If the metric takes some time to reach the maximum value, you might erroneously write an equals check that checks for a smaller value. Later on, this will lead to a flaky test situation.

To fix this, I’d suggest running the flaky test locally and trying to write an assertion for a bigger value. Now try to run the same test again and see what happens. Also, look at the logic inside of your code – perhaps it would be possible to calculate the maximum value without using a hard-coded constant?

Check for specific metrics

By default, the end-to-end testing framework only accepts a metric name and sums up all matching series. I would encourage you to be maximally specific because metrics are not usually transactional. In other words, the user (in this case, the e2e testing framework) might see some results that are in the middle of being updated. For instance, if the state is constantly fluctuating then the sum will change but the sum of all matching metrics might never be equal to the desired value. Imagine that we have some code like this (pseudo-code):

foometric.set(123)
// <--- E2E testing framework request for /metrics page comes in here.
barmetric.set(5000)

If the metrics weren’t set previously then the sum can either be 123 or 5123. Hence, the best practice is usually to be as specific as possible with the label matchers. You can use this function to do that.