OpenTelemetry-go makes it hard to migrate to exponential histograms

github.com/prometheus/client_golang has a nice feature where a histogram can simultaneously be both an exponential (native) histogram and a classic (explicit bucket) one. Then, depending on what formats are accepted by the scraper, either both the native histogram version and the classical one are exposed, or just the classical one (text format). There has been an initiative to expose native histograms in text format, but it hasn’t yet solidified.

Unfortunately, in the OpenTelemetry standards, a histogram at any point can only have either explicit buckets or exponential buckets. From https://opentelemetry.io/docs/specs/otel/metrics/data-model/#model-details:

In this low-level metrics data model, a Timeseries is defined by an entity consisting of several metadata properties:

  • Metric name
  • Attributes (dimensions)
  • Value type of the point (integer, floating point, etc)
  • Unit of measurement

The primary data of each timeseries are ordered (timestamp, value) points, with one of the following value types:

  • Counter (Monotonic, Cumulative)
  • Gauge
  • Histogram
  • Exponential Histogram

Unfortunately, native (exponential) and classic (explicit bucket) expose different metric names even though they have the same metric family. An explicit bucket histogram exposes these extra metrics for each histogram:

  • METRICNAME_bucket
  • METRICNAME_sum
  • METRICNAME_count

Whereas with a native histogram, you would only get METRICNAME. This means that when switching to exponential histograms, you will have to create what is called “a view” of each histogram that you want to convert. The code looks as follows:

package main

import (
	"go.opentelemetry.io/otel/sdk/instrumentation"
	"go.opentelemetry.io/otel/sdk/metric"
)

func main() {
	exponentialView := metric.NewView(
		metric.Instrument{
			Name:  "latency",
			Scope: instrumentation.Scope{Name: "http"},
		},
		metric.Stream{
		        Name: "exponentiallatency",
			Aggregation: metric.AggregationBase2ExponentialHistogram{
				MaxSize:  160,
				MaxScale: 20,
			},
		},
	)
	explicitView := metric.NewView(
		metric.Instrument{
			Name:  "latency",
			Scope: instrumentation.Scope{Name: "http"},
		},
		metric.Stream{
		        Name: "explicit",
			Aggregation: metric.AggregationExplicitBucketHistogram{
					Boundaries: []float64{0, 1, 2, 5, 10},
			},
		},
	)
	

	// The created view can then be registered with the OpenTelemetry metric
	// SDK using the WithView option.
	_ = metric.NewMeterProvider(
		metric.WithView(exponentialView, explicitView),
	)
}

(from https://pkg.go.dev/go.opentelemetry.io/otel/sdk/metric)

What is worse is that this API forces you to redefine the boundaries in the view, even though they could be (and are) defined near the histogram call, where they are created.

So, it seems to me like the suggestion would be to use just AggregationBase2ExponentialHistogram and then after deploying these changes, go rush to change all your alerting/recording rules + dashboards.

My gRPC Annoyances

We have been using gRPC in Thanos since Thanos inception – it has served us great and it has a ton of useful functionality, it solves a lot of problems, it is easy to use, and so on. However, I feel like some stuff is lacking, especially performance that will most likely never be fixed (or, I should say, changed). The framework just does not solve 100% of the things that the Thanos project needs right now. Let’s go through the list of my pet peeves.

Sophisticated compression

A huge part of network traffic between servers in Thanos is used for sending sets of labels, essentially a map<string, string>. Typically, each map only differs by two values between consecutive maps.

Also, we really need streaming to avoid having to buffer the whole response in memory beforehand. Strictly speaking, streaming might not be required because we buffer everything on the querier side either way right now, but there needs to be a way of building the message incrementally.

However, because gRPC is message-based, it means that it is impossible to have a compression scheme that encompasses multiple messages. What I would like to have is some string table that is shared over the whole stream. This also might be in part because gRPC uses HTTP as its transport and there it is impossible to negotiate compression parameters “out of band”. While researching this topic, I stumbled upon https://github.com/WICG/compression-dictionary-transport. Perhaps this will get through at some point and gRPC will be able to leverage this work.

Compression helps a little bit here but not a lot. The gRPC codec interface has []byte as an input, meaning that the input needs to be a contiguous array of bytes in memory. Generally speaking with compression, repeated sequences of bytes have references to them so it would be possible to avoid allocating memory for those repeated sequences if the protobuf unmarshaling interface didn’t need a []byte and would instead be a io.Reader. https://github.com/grpc/grpc-go/issues/499 misses a huge point – the unmarshaling, I believe, still could accept an io.Reader instead of []byte. The point about Marshal() is valid, though.

Fortunately, there is some recent movement to fix this: https://github.com/grpc/grpc-go/issues/6619.

Mirages of ownership

gRPC-Go presents an interesting conundrum to its users – when the user-written code returns from a function that serves a remote procedure call, gRPC-Go takes that variable and marshals it at some point in the future. This is done for performance purposes but it also means that the user is giving away the ownership of the data and that the data must always be not changed after returning. This could become a foot gun in case slices or maps are used because the value of variables of such types are references, making it easy to mutate them accidentally.

Here’s how the “hello world” example looks in grpc-go:

// SayHello implements helloworld.GreeterServer
func (s *server) SayHello(ctx context.Context, in *pb.HelloRequest) (*pb.HelloReply, error) {
	log.Printf("Received: %v", in.GetName())
	return &pb.HelloReply{Message: "Hello " + in.GetName()}, nil
}

In this case, grpc-go takes ownership of &pb.HelloReply{Message: "Hello " + in.GetName()}. This obviously presents no problems but what if there were some slice or map?

type server struct {
  names []string
  ...
}

// Constantly running in the background.
func (s *server) updateNames() {
  for {
    for i := 0; i < len(s.names); i++ {
      s.names[i] = generateRandomName()
    }
    time.Sleep(1*time.Second)
  }
}

// SayHello implements helloworld.GreeterServer
func (s *server) SayHello(ctx context.Context, srv pb.HelloServer) (error) {
        return srv.SendMsg(&pb.HelloReply{Names: s.names})
}

Now this is a bit bad because grpc-go might be trying to marshal s.names while it is being updated in the background. You can find some more context here.

Of course, this is a contrived example and you might think how this could even happen in practice. I had a botched attempt at adding pooling to the marshaling side in Thanos: https://github.com/thanos-io/thanos/pull/4609/files. Fortunately, SendMsg now includes a helpful text:

	// It is not safe to modify the message after calling SendMsg. Tracing
	// libraries and stats handlers may use the message lazily.

Hopefully, all of these problems will be fixed at some point. I am confident that the Codec interface will soon change for the better but the compression stuff will take a longer time. It would allow us to really reduce the memory and CPU usage of Thanos.

Perhaps we will also look at other RPC frameworks in the future. dRPC is a recent project that piqued my interest. I have only dabbled with it for a few hours so I don’t have any opinion on it so far. That’s something for future posts!