Everything You Need To Know About The Index Cache Size in Thanos

Thanos LogoJust like most pieces of software nowadays, Thanos is not an exception and there is some caching going on there. In particular, we will talk about the index cache and its’ size in Thanos Store. After a certain bug was fixed, a lot of problems came up to users who were running with the default size of 200MiB. This is because this limit started being enforced whereas it was not before.

I feel that it would be the perfect opportunity to explain how it works and how to determine what would be the appropriate size in your deployment.

Modus Operandi

Thanos Store, on a user’s request, needs to go into the configured remote storage and retrieve the data that it needs to fulfill that query. However, how does it know what samples to retrieve? The answer is index files. Just like the TSDB used on Prometheus, it needs the index files to know where to get the relevant data to execute a user’s Series() call.

There are two types of items stored in that cache: postings and series. You can find all of the detailed information here however let me sum it up in this post.

So, first of all, we need to find out in which series we will find data that contain a given label pair. This is what postings give us.

Now… what is the series data? If you have ever seen how the TSDB looks like on disk, you might have seen that there is a directory called chunks. That is where the actual series data lays… however, how do we know what is in there? That is where the series data in the index files come in. It contains a bunch of information about where to find it like: chunks count, references to data, minimum and maximum time, et cetera.

Thus, to avoid constantly looking up the same data in the indices if we are refreshing the same dashboard in Grafana, an index cache was added to Thanos Store. It saves a ton of requests to the underlying remote storage.

How do we know that it is working, though? Let’s continue on to the next section…

Available metrics

  • thanos_store_index_cache_items_added_total – total number of items that were added to the index cache;
  • thanos_store_index_cache_items_evicted_total – total number of items that were evicted from the index cache;
  • thanos_store_index_cache_requests_total – total number of requests to the cache;
  • thanos_store_index_cache_items_overflowed_total – total number of items that could not be added to the cache because they were too big;
  • thanos_store_index_cache_hits_total – total number of times that the cache was hit;
  • thanos_store_index_cache_items – total number of items that are in the cache at the moment;
  • thanos_store_index_cache_items_size_bytes – total byte size of items in the cache;
  • thanos_store_index_cache_total_size_bytes – total byte size of keys and items in the cache;
  • thanos_store_index_cache_max_size_bytes – a constant metric which shows the maximum size of the cache;
  • thanos_store_index_cache_max_item_size_bytes – a constant metric which shows the maximum item size in the cache

As you can see, that’s a lot to take in. But, it is good news since we know a lot about the current state of it at any time.

Before this bug was fixed in 0.3.2, you would have been able to observe that thanos_store_index_cache_items_evicted_total was mostly always 0 because the current size of the index cache was not being increased when adding items. Thus, the only time we would have evicted anything from the cache is when this huge, internal limit was hit.

Obviously, this means that back in the day RAM usage was growing boundlessly and users did not run into this problem because we were practically caching everything. That is not the case anymore.

Currently, to some users, the issue of a too small index cache size manifests as the number of goroutines growing into the tens of thousands when a request comes in. This happens because each different request goes into its own goroutine and we need to retrieve a lot of postings and series data if the request is asking for a relatively big amount of data, and it is not in the cache (thanos_store_index_cache_hits_total is relatively small compared to thanos_store_index_cache_requests_total).

Determining the appropriate size

So, let’s get to the meat of the problem: if the default value of 200MiB is giving you problems then how do you select a value that is appropriate for your deployment?

Just like with all caches, we want it to be as hot as possible – that means we should almost always practically hit it. You should check if in your current deployment thanos_store_index_cache_hits_total is only a bit lower than thanos_store_index_cache_requests_total. Depending on the number of requests coming in, the difference might be bigger or lower but it should still be close enough. Different sources show different numbers but the hit ratio ideally should be around 90% but lower values like 50 – 60 % are acceptable as well.

Theoretically, you could take the average size of the index files and figure out how many of them you would want to hold in memory. Then multiply those two and specify it as --index-cache-size (we will be able to hold even more series and postings data since the index files contain other information).

Next thing to look at is the difference between thanos_store_index_cache_items_added_total and thanos_store_index_cache_items_evicted_total in some kind of time window. Ideally, we should aim to avoid the situation where we are constantly adding and removing items from the cache. Otherwise, it will lead us to cache thrashing and we might see that Thanos Store is not performing any kind of useful work and that the number of goroutines is constantly high (in the millions). Please note that the latter metric is only available from 0.4.

Another metric which could indicate problems is thanos_store_index_cache_items_overflowed_total. It should never be more than 0. Otherwise, it means that either we tried to add an item which by itself was too big for the cache, or we had to remove more than saneMaxIterations items from the cache, or we had removed everything and it still cannot fit. It mostly only happens when there is huge index cache pressure and it indicates problems if it is more than 0. To fix it, you need to increase the index cache size.

Finally, please take a look at the query timings of requests coming into your deployment. If it takes more than 10 seconds to open up a dashboard in Grafana with 24 hours of data then it most likely indicates problems with this too.

Lastly, let me share some numbers. On one of my deployments, there are about ~20 queries coming in every second. Obviously, it depends on the nature of those queries but having an index cache of 10GB size makes it last for about a week before we hit the limit and have to start evicting some items from it. With such size, the node works very smoothly.

Choosing Maximum Concurrent Queries in Prometheus Smartly

The popular monitoring software Prometheus has recently introduced (from 2.6) new knobs which regulate how many queries can be executed concurrently.

Prometheus logo from Wikipedia
Prometheus logo

There are even the same knobs for different interfaces. Here they are with their default values:

  • –storage.remote.read-concurrent-limit=10
  • –query.max-concurrency=20

The latter is an upper-bound value to the former option. However, the question is:

How do you choose sensible values for them?

Well, I think I have the answer.

The number should be picked such that it does not exceed the number of threads of execution on your (virtual) machine. Ideally, it should be a bit lower because if your machine will encounter huge queries, it is (probably) going to also use the CPU for other operations such as sending the packets over a network.

I recently noticed empirically that executing a “huge” (let’s not delve into the definition here) query makes my Prometheus Docker container start using just a bit more than 100% of the CPU. This gave me an idea for this blog post.

Go has a concept called goroutines which are light-weight threads that are run later on “real” threads. Also, a single goroutine is an indivisible unit of work that can be scheduled only on one thread of execution at any time.  So, the question becomes: is more than one goroutine spawned during the parsing of a query?

Let’s delve into the code. We will go bottom-up: we are going to work our way upwards. Sorry if I will miss some kind of detail – you can find all of the exact information in the source code.

Prometheus has a type v1.API which handles the HTTP calls. The api.queryRange function gets spawned in a fresh, new goroutine which handles the request and returns the result. The API type itself has a member called QueryEngine which handles the lifetime of queries from the beginning till the end, and it is connected to a querier which queries the underlying TSDB.

In that function, a new range query using that querying engine is created with NewRangeQuery and then the Exec method is called on it which actually does the query. A context is passed to it which is used to limit the amount of time that it can take to perform the query.

For better or worse, the Prometheus code has a lot of types. Thus, to avoid blowing this post out of proportion and just copying, and pasting the source code, I will sum things up.

It trickles down to selecting blocks according to the specified time range. Then the blocks which are in that range are iterated over and a NewBlockQuerier is created, and then they are joined into a bigger querier which is returned for evaluating the expression that was passed. For the write-ahead-log, a segments querier is created which handles the queries that touch the WAL.

With remote read, it is a bit different. That mechanism employs what is called a fanout type which implements the queryable interface. In essence, it sends those queries to other Prometheus instances and merges them. Here, it may be that more goroutines are spawned but they are not performing much of active work – only sending a query and waiting for a result – thus we will not count them. The same principle of passing a context everywhere is used which limits the amount of time it can take.

The comment in prometheus/tsdb/db.go:801says:

// Querier returns a new querier over the data partition for the given time range.
// A goroutine must not handle more than one open Querier.
So, a single goroutine cannot have more than one Querier. It more or less answers the original question but just for the sake of clearness, let’s see how all the top-level types are connected.
The main type DB has a member of type Head which consequently has a member of type *wal.WAL (new data that was persisted on the disk) and *stripeSeries (new data that is still on RAM with some optimizations to avoid more lock contention).
Here is how the type architecture looks like:
Picture showing the relation by the types DB, Head, *wal.WAL, and *stripeSeries
Type architecture
Because TSDB is append-only, the queries can be executed concurrently without locking the series data which is already on disk because they cannot change if the compaction is off, and if certain methods like Delete() are not called. Clever usage of RWMutex permits for it to work that way.
If the compaction is on, the blocks are being regularly compacted in a separate goroutine and reloaded which is seamless. Also, all of the blocks types guarantee atomicity per-block so that case is protected against race conditions as well and it only takes the minimal amount of locks.
All in all, we have just seen that Prometheus is really trying hard to avoid locking as much as possible, and TSDB queriers execute in the same goroutine’s context of their users. This means that the maximum amount of concurrently permitted queries should not exceed the number of threads of execution of the CPU, and ideally, it should be even a bit lower because some work has to be performed for adding new data – scraping, parsing, committing the new metrics.
If in doubt, please always evaluate what are the current latencies of queries coming into your Prometheus machines, and check if they have decreased or increased after making the changes. You could create a nice Grafana dashboard for that. Obviously, revert the changes if they had not helped you but I am pretty sure that they will if you are, for example, using the default limits and your CPU has many more cores than 20.