Everything You Need To Know About The Index Cache Size in Thanos

Thanos LogoJust like most pieces of software nowadays, Thanos is not an exception and there is some caching going on there. In particular, we will talk about the index cache and its’ size in Thanos Store. After a certain bug was fixed, a lot of problems came up to users who were running with the default size of 200MiB. This is because this limit started being enforced whereas it was not before.

I feel that it would be the perfect opportunity to explain how it works and how to determine what would be the appropriate size in your deployment.

Modus Operandi

Thanos Store, on a user’s request, needs to go into the configured remote storage and retrieve the data that it needs to fulfill that query. However, how does it know what samples to retrieve? The answer is index files. Just like the TSDB used on Prometheus, it needs the index files to know where to get the relevant data to execute a user’s Series() call.

There are two types of items stored in that cache: postings and series. You can find all of the detailed information here however let me sum it up in this post.

So, first of all, we need to find out in which series we will find data that contain a given label pair. This is what postings give us.

Now… what is the series data? If you have ever seen how the TSDB looks like on disk, you might have seen that there is a directory called chunks. That is where the actual series data lays… however, how do we know what is in there? That is where the series data in the index files come in. It contains a bunch of information about where to find it like: chunks count, references to data, minimum and maximum time, et cetera.

Thus, to avoid constantly looking up the same data in the indices if we are refreshing the same dashboard in Grafana, an index cache was added to Thanos Store. It saves a ton of requests to the underlying remote storage.

How do we know that it is working, though? Let’s continue on to the next section…

Available metrics

  • thanos_store_index_cache_items_added_total – total number of items that were added to the index cache;
  • thanos_store_index_cache_items_evicted_total – total number of items that were evicted from the index cache;
  • thanos_store_index_cache_requests_total – total number of requests to the cache;
  • thanos_store_index_cache_items_overflowed_total – total number of items that could not be added to the cache because they were too big;
  • thanos_store_index_cache_hits_total – total number of times that the cache was hit;
  • thanos_store_index_cache_items – total number of items that are in the cache at the moment;
  • thanos_store_index_cache_items_size_bytes – total byte size of items in the cache;
  • thanos_store_index_cache_total_size_bytes – total byte size of keys and items in the cache;
  • thanos_store_index_cache_max_size_bytes – a constant metric which shows the maximum size of the cache;
  • thanos_store_index_cache_max_item_size_bytes – a constant metric which shows the maximum item size in the cache

As you can see, that’s a lot to take in. But, it is good news since we know a lot about the current state of it at any time.

Before this bug was fixed in 0.3.2, you would have been able to observe that thanos_store_index_cache_items_evicted_total was mostly always 0 because the current size of the index cache was not being increased when adding items. Thus, the only time we would have evicted anything from the cache is when this huge, internal limit was hit.

Obviously, this means that back in the day RAM usage was growing boundlessly and users did not run into this problem because we were practically caching everything. That is not the case anymore.

Currently, to some users, the issue of a too small index cache size manifests as the number of goroutines growing into the tens of thousands when a request comes in. This happens because each different request goes into its own goroutine and we need to retrieve a lot of postings and series data if the request is asking for a relatively big amount of data, and it is not in the cache (thanos_store_index_cache_hits_total is relatively small compared to thanos_store_index_cache_requests_total).

Determining the appropriate size

So, let’s get to the meat of the problem: if the default value of 200MiB is giving you problems then how do you select a value that is appropriate for your deployment?

Just like with all caches, we want it to be as hot as possible – that means we should almost always practically hit it. You should check if in your current deployment thanos_store_index_cache_hits_total is only a bit lower than thanos_store_index_cache_requests_total. Depending on the number of requests coming in, the difference might be bigger or lower but it should still be close enough. Different sources show different numbers but the hit ratio ideally should be around 90% but lower values like 50 – 60 % are acceptable as well.

Theoretically, you could take the average size of the index files and figure out how many of them you would want to hold in memory. Then multiply those two and specify it as --index-cache-size (we will be able to hold even more series and postings data since the index files contain other information).

Next thing to look at is the difference between thanos_store_index_cache_items_added_total and thanos_store_index_cache_items_evicted_total in some kind of time window. Ideally, we should aim to avoid the situation where we are constantly adding and removing items from the cache. Otherwise, it will lead us to cache thrashing and we might see that Thanos Store is not performing any kind of useful work and that the number of goroutines is constantly high (in the millions). Please note that the latter metric is only available from 0.4.

Another metric which could indicate problems is thanos_store_index_cache_items_overflowed_total. It should never be more than 0. Otherwise, it means that either we tried to add an item which by itself was too big for the cache, or we had to remove more than saneMaxIterations items from the cache, or we had removed everything and it still cannot fit. It mostly only happens when there is huge index cache pressure and it indicates problems if it is more than 0. To fix it, you need to increase the index cache size.

Finally, please take a look at the query timings of requests coming into your deployment. If it takes more than 10 seconds to open up a dashboard in Grafana with 24 hours of data then it most likely indicates problems with this too.

Lastly, let me share some numbers. On one of my deployments, there are about ~20 queries coming in every second. Obviously, it depends on the nature of those queries but having an index cache of 10GB size makes it last for about a week before we hit the limit and have to start evicting some items from it. With such size, the node works very smoothly.

Lessons Learned From Trying To Validate a Software Business Idea for the First Time

The landing page of date4gamers

 

Like almost everyone, I also dream about starting my own business so that I could be free from the shackles of someone else and I would be my own boss. Or it could become potentially a source of passive income.

As such, I have started reading some literature and sites like Indiehackers to learn about how others start their own software businesses. After all, software is the thing that I am most skilled in and so I ought to connect that with the other things which are involved in having a successful business to start my own software company.

This will be a post about my attempt to validate the first product idea. The whole purpose of that is to check if your idea is viable i.e. it solves actual problems that people have, if it is feasible, before actually starting to build it.

Idea

I thought there was a place in the market for a dating site which would connect two different things – gaming and dating. The dating site would have provided a way to add more info about yourself besides games. Originally, it should have only supported Steam so that you could, essentially, find people around you who are into the same games.

Furthermore, it would have had Tinder-style dating – essentially it would have used a “minimalistic” user interface through which one could’ve been matched with other people who were playing the same type of games, and or the same amount of time.

Timeline

Initial problems

Having or making a dating already entails a lot of issues:

  • the privacy of its users as per the GDPR and what the users expect – the ability to request information about yourself that you have in the system, the ability to delete your own account, and so on;
  • protection against harassment and perils.

Thus, it means that if one were to make a prototype dating site, it would take so much more time to bring it up to a level which was necessary for any kind of website like that. That’s why I have chosen to make a landing page at first.

Landing page

I made the website with simple static HTML and JS, and by using the Bulma CSS framework. I have used this template as a reference. Let me confess: at first I have tried to do a landing page without using any kind of CSS framework but in $CURRENT_YEAR it is nigh impossible to do that and have the website scale to all kinds of different devices effortlessly. I had some kind prototype version that uses pure CSS but when I had opened it on my Samsung phone, I saw a horrible misrendering of it.

The value proposition to the potential users should be clear from the landing page but it was kind of hard to do that in my case. However, I agree that I could have done a better job – it is kind of hard to understand how my website was to differentiate from others judging just from that landing page. On the other hand, I think that there wouldn’t have been much difference because we already know now in hindsight that it is an oversaturated market already, it is hard to achieve a breakthrough, and that this is not a problem that the majority of the people who use dating sites have.

Also, you can tell from the design that I am not the best at it – my brain is trained to care much more about the functional properties of things instead of the design – ease of use, understanding, attracting users. I still need to improve a lot on this. That’s why I am thinking that for the next attempt I will create a prototype which will not have a lot of user interface elements, and it will be mostly a service which provides value for its users.

Facebook woes

It has never passed the “reviewing” state

At first I wanted to make my campaign on Facebook but funnily enough, they do not even accept advertisement campaigns which have anything to do with dating. This is most certainly related to my points before – it is hard to make a good-enough dating site. Even a prototype.

Also, Facebook’s advertisement campaigns are a bit of a pain in the ass since you have to create an associated page in their system with the ad – probably because people can see which page has released that by clicking on the burger menu.

After all of this, I have decided to go to Google’s Ads.

How did it go

The data of visitors to date4gamers over the advertisement campaign’s length

I have spent 20 euros on this advertisement campaign and I got around 630 users are you can see in this picture. Only 2 users have signed up to the mailing list which means that I got a very minuscule 0.3% of conversions.

This indeed spells out a very negative response to the landing page and the whole idea. However, perhaps my campaign was not as effective since it seems like the majority of people came from countries where English is not an official language.

Locations of people who saw/clicked on my ad during a period of few days

Funnily enough, the people who registered for the mail campaign are from India and Saudi Arabia. I want to say that perhaps this can be associated with the state of the society in those countries i.e. repression of women’s rights, and the general gender disbalance there? I don’t actually know but just with this data, I think, we can tell that the market for this kind of thing is simply not big enough.

Conclusion

Any kind of product idea that you might have when presented to others should immediately attract an immense amount of potential clients. If not, then it’s most likely not worth doing like in my case. Also, ideally you would have some kind of prototype to show to users so that you could attract them even more. A picture is worth a thousand words but a working prototype (the MVP, maybe even) is worth a thousand pictures because it allows the users to get a feel of it and make their own opinion about it.

Certain types of ideas are very risky such as dating sites because they are associated with scammers who use sex as a way to bait others into visiting their website, and sending their own bank account details.

In foresight, it might be hard to tell where your potential customers are if you are targeting a wide audience. That is why ideally you should work with concrete people who have specific problems that you should try to solve.

And I will try to soak up all of these lessons for the next side project attempt that I am going to do in the near future, as should you.