OpenTelemetry-go makes it hard to migrate to exponential histograms

github.com/prometheus/client_golang has a nice feature where a histogram can simultaneously be both an exponential (native) histogram and a classic (explicit bucket) one. Then, depending on what formats are accepted by the scraper, either both the native histogram version and the classical one are exposed, or just the classical one (text format). There has been an initiative to expose native histograms in text format, but it hasn’t yet solidified.

Unfortunately, in the OpenTelemetry standards, a histogram at any point can only have either explicit buckets or exponential buckets. From https://opentelemetry.io/docs/specs/otel/metrics/data-model/#model-details:

In this low-level metrics data model, a Timeseries is defined by an entity consisting of several metadata properties:

  • Metric name
  • Attributes (dimensions)
  • Value type of the point (integer, floating point, etc)
  • Unit of measurement

The primary data of each timeseries are ordered (timestamp, value) points, with one of the following value types:

  • Counter (Monotonic, Cumulative)
  • Gauge
  • Histogram
  • Exponential Histogram

Unfortunately, native (exponential) and classic (explicit bucket) expose different metric names even though they have the same metric family. An explicit bucket histogram exposes these extra metrics for each histogram:

  • METRICNAME_bucket
  • METRICNAME_sum
  • METRICNAME_count

Whereas with a native histogram, you would only get METRICNAME. This means that when switching to exponential histograms, you will have to create what is called “a view” of each histogram that you want to convert. The code looks as follows:

package main

import (
	"go.opentelemetry.io/otel/sdk/instrumentation"
	"go.opentelemetry.io/otel/sdk/metric"
)

func main() {
	exponentialView := metric.NewView(
		metric.Instrument{
			Name:  "latency",
			Scope: instrumentation.Scope{Name: "http"},
		},
		metric.Stream{
		        Name: "exponentiallatency",
			Aggregation: metric.AggregationBase2ExponentialHistogram{
				MaxSize:  160,
				MaxScale: 20,
			},
		},
	)
	explicitView := metric.NewView(
		metric.Instrument{
			Name:  "latency",
			Scope: instrumentation.Scope{Name: "http"},
		},
		metric.Stream{
		        Name: "explicit",
			Aggregation: metric.AggregationExplicitBucketHistogram{
					Boundaries: []float64{0, 1, 2, 5, 10},
			},
		},
	)
	

	// The created view can then be registered with the OpenTelemetry metric
	// SDK using the WithView option.
	_ = metric.NewMeterProvider(
		metric.WithView(exponentialView, explicitView),
	)
}

(from https://pkg.go.dev/go.opentelemetry.io/otel/sdk/metric)

What is worse is that this API forces you to redefine the boundaries in the view, even though they could be (and are) defined near the histogram call, where they are created.

So, it seems to me like the suggestion would be to use just AggregationBase2ExponentialHistogram and then after deploying these changes, go rush to change all your alerting/recording rules + dashboards.

Jsonnet is not so bad

Okay, jsonnet does not have such a bad experience nowadays – there is a pretty good LSP server implementation thanks to the team at Grafana, there’s jsonnetfmt that makes the jsonnet code style the same across projects, and we have LLMs that make it much easier to learn any new language.

Also, since it is an interpreted language that simply yields JSON, it is very easy to prototype and test changes. This was especially apparent during the recent hackathon where our team used jsonnet to implement a “dynamic dashboard generator” in Grafana, and we won the prize. Since all Grafana dashboards are JSONs, it is very to generate them with a language specifically geared towards it.

The problem with JSON, though, is that it by itself has no schema. That is implemented by the JSON schema project. And even then, it is not ideal because it doesn’t fully express all constraints that you might need, e.g. if some value is X, then another value can only be between 0 and 42. Plus, jsonnet has no relation to json schema except that both use jsons. Grafana also famously doesn’t have any (strict) JSON schema. There has been some movement towards it but still no formal specification exists as far as I know.