Introducing SAM: Similar Alerts Manager (My Side-Project)

 _______  _______  _______ 
(  ____ \(  ___  )(       )
| (    \/| (   ) || () () |   SIMILAR
| (_____ | (___) || || || |
(_____  )|  ___  || |(_)| |   ALERTS
      ) || (   ) || |   | |
/\____) || )   ( || )   ( |   MANAGER
\_______)|/     \||/     \|

(sorry, I do not have a professional designer at my disposal)

Why?

At the moment, Prometheus only supports a rudimentary way to look up what alerts have been firing in the past and there is no way to “persist” those metrics: there is a synthetic metric called ALERTS which shows what alerts were firing in the past. That is not enough to be able to tell globally what alerts have been firing after Prometheus has restarted. You could use a full-fledged solution like Thanos to solve this problem however that is very cumbersome if you only want this small feature and…

Furthermore, it is hard to tell how different alerts are related to each other. Right now, alerts are differentiated by their label sets. Yes, we could look at a graph of different ALERTS values but it would require a lot of squinting and deducing to see how different alerts are related because some of them might be firing repeatedly a few times in the past and thus at different points in time they could be related to different alerts, and so on. Plus, if you are using something like Thanos and different Prometheus instances have the same alert rules then it might be that you will have almost identical metrics in that graph which will make things even harder.

Thus, something was needed which would look at those historical alerts, persist the data, and aggregate them so that it would be easy to look up similar alerts. This is where the similar alerts manager comes in.

What?

Similar alerts manager or SAM, in short, is a daemon which sits in the background that periodically retrieves new alert information, parses them, and saves the information into a cache, and provides an HTTP API which permits the users to access this information.

It is a side project and thus it is not very polished however it does its job. Also, this project is somewhere between having nothing in this regard and using a fully fledged solution that is provided by a start-up like SignifAI.

New alert information is retrieved from an Elastic Search cluster that is specified by the user. New alerts information are added there via alertmanager2es which hooks into AlertManager. We could directly retrieve alerts from AlertManager by implementing an HTTP server ourselves but I feel that pushing everything to an Elastic Search cluster means that it is easier to discover through other means such as a Kibana instance.

Related alerts are retrieved by always keeping a stack of hashes of the label sets of the firing alerts, and whenever a new alert comes in and it is firing then all of the currently firing alerts are related to it.

For the persistence layer, I have chosen to try out Redis which I have never used. It is indeed very elegant and the command model fits very nicely with the use-case. However, at the current version, only the string will all of the information is saved into Redis but in the future, it might be reworked so that proper hash-map commands are used instead of doing it the brain-dead way as it is right now.

I have chosen the Go programming language to do this project as I feel that it is easy to produce easily understandable, concurrent code, and I have a feeling that it is easy to become productive with it. Maybe for the next projects, I will choose something else like Elixir or Rust to try new things 🙂 I know that it is controversial in some regards such as error handling and generics but I still like it as it feels like it is a spiritual successor to the C programming language.

Here is how SAM’s architecture looks like:

architecture

Lessons learned

Honestly, the first lesson that I have learned is that for high-level languages such as Go there are a lot of different frameworks that you should really try out and should not try to reinvent the wheel. It is really amazing how much good free software is out there and you should not be afraid to reuse other stuff in your side projects.

Secondly, sometimes the Go’s visibility rules are a pain. For example, I want to have a separate package which is responsible for the cache-related functions. However, because it needs to read all of the state’s members, I cannot make them private. We could make a custom marshaler for that type but that is painful and so the proper fix here is to move the saving of alerts’ information into a completely separate part and put it into a hash-map in Redis. This way, we can hide the alerts information and reduce the usage of RAM because we would not have to store everything.

Thirdly, travis is a pretty cool tool. I feel like it is somewhere between Drone CI and a full blown thing which lets you do custom things like Jenkins or TeamCity. travis covers already most of the use-cases by having already made templates for projects with different programming languages. For example, for Go you may only need to specify language: go and that will be it. Everything else will be handled for you.

Lastly, ElasticSearch can be tricky sometimes because it lets the user specify the refresh interval i.e. the time between the updates of the “view” of the data. This means that you can, theoretically, push new data and not see it in Kibana immediately. This is controlled by the option refresh_interval. Thus, the lesson is that ElasticSearch is quite advanced software and there sometimes might be knobs that you might have never thought about, that they even exist in the first place.

What now?

Even though SAM works right now but it needs lots of polish and improvements. The amount of attention that it will get depends on if a lot of other people will find it useful, of course. I will fix the issues mentioned soon.

Besides that, SAM is already usable and you can try it out. Just grab it from the Docker Hub by running these commands:

docker pull stag1e/sam
docker run --rm -it -p 9888:9888 stag1e/sam --elasticsearch 'http://127.0.0.1:1234' --redis '127.0.0.1:3333'

You must specify the IP addresses of Redis and ElasticSearch with these options:

  • -l / –elasticsearch: the URL to the ElasticSearch server
  • -r / –redis: IP and port pair of the Redis server

You can use docker-compose to automatically prepare a simple deployment for testing purposes:

cd docker/
docker-compose -f docker-compose-dev.yml up -d

That will set up the following things in a simple configuration:

  • ElasticSearch
  • Kibana
  • AlertManager
  • Alertmanager2es
  • Redis

So that you would be able to try out SAM. After, add the index template by running: ./scripts/add_es_template.sh and then you can run SAM. For brevity, I will avoid rewriting the same instructions that are available in the repository itself here.

As always, pull requests and bug reports are welcome! Thank you for reading and happy hacking!

Capacity planning of Prometheus 2.4.2 + Thanos Sidecar 0.1.0

As you can see, capacity planning is at the top of the Ops section.

Intro

Having a monitoring infrastructure is one of the tenets of the DevOps world. And, it seems that Prometheus and all of its extraneous integrations such as Thanos or Uber’s M3DB is taking over that world slowly. They give you a lot of improvements over the old Graphite set-ups such as increased reliability since Prometheus is based on HTTP which uses TCP thus you explicitly know when something goes wrong compared to UDP; it uses a pull based model so you explicitly know when something goes wrong and so on.

Thus, with all of the new users adopting it, a question begs to be asked: how do we properly plan our system’s capacity so that we would withhold all of the possible queries? This is, in general, very important in the DevOps world.

I learned a lot about doing that with Prometheus and Thanos Sidecar over the past few months so let me share with you my insights. Hopefully, you will find this useful.

All of the following thoughts are based on Prometheus 2.4.2 and Thanos 0.1.0. It might and will most likely change in the future version(-s).

I will not mention other things that may not be as important as other things like disk throughput. The majority of modern hardware should be good enough for Prometheus deployments of any size.

Most important components for Prometheus + Thanos Sidecar

CPU

The CPU usage is deeply impacted by the actual content of the PromQL queries that are being executed. To be even more exact, what matters is what kind of (if any) aggregation operators or math functions you are using in the queries. Obviously, functions such as time() do not cost a lot since you only the Unix timestamp of the current time.

Naturally, it follows that functions which use a range vector use more CPU time than those which take an instant vector as you have to iterate over more than 1 value (usually). Obviously, even functions that take instant vectors can be costly such as the sort() function. Also, because it is hard to intercept PromQL queries by their content as they are coming in, it does not make sense to talk much about the different functions but stuff like holt_winters() and quantile_over_time() indeed takes the most time.

This hooks up into the other functionality of Prometheus: it has a concept of alerting rules. They are being periodically evaluated against the data and they transition from pending to firing. Alas, if you do a lot of number crunching in different expressions, you can move the common part to a what is called a “recording rule”. Then, you can use the resulting name in all of the expressions where you need that value. This avoids recalculating the same thing over and over.

As for Thanos Sidecar, this does not impact it whatsoever since it only passes on the results to Thanos Query which actually computes the final results having the (deduplicated) time series data after evaluating the query.

This, in practice, means that for a separate Prometheus + Thanos Sidecar deployment, it does not matter at all what actual queries the user is sending besides the number of time series that they return.

RAM

Roughly, the RAM usage is equivalent to the sum of the:

  • resident TSDB size on-disk because Prometheus internally uses mmap(2) so it is possible that all data that is on disk may be allocated in the memory too (applies only to Prometheus). Note that in this the new metrics ingestion rate is included as well because all new metrics data is added to the WAL which uses the same principle as the rest of the TSDB mechanism;
  • and buffers used for the responses and requests (applies to both Thanos Sidecar and Prometheus).

The first item is a more static size whereas the second one is more dynamic. The Prometheus options of limiting the amount of concurrent requests and the number of samples that they return through different interfaces comes into play here. Consider limiting them even more if you have a smaller machine after you stress test it and get the results. The most important options are:

  • –storage.remote.read-concurrent-limit
  • storage.remote.read-sample-limit
  • –query.max-concurrency
  • –query.max-samples

Stress testing existing Prometheus set-ups

Such things as prombench and thanosbench already exist which benchmark the Prometheus and Thanos software respectively. However, that software is written with the assumption that the user has an access to some kind of Kubernetes cluster where everything will be performed. Unfortunately, that is not always the case.

To stress test existing setups, you might use features such as the Consul service discovery for Prometheus. Then, you push some fixed amount of data into it and query it through Thanos Sidecar. That way all of the parts of the system will be tested – the remote read interface, ingestion, and so on.

It might look like this:

This is a simplified view, obviously. Consider using Consul service discovery or some other mechanism to discover the machine that is running the tests. It will make your life much easier.

Re-use pushgateways

The race conditions between the scrapes and the HTTP server is inherent so re-use PushGateway to make life easier. Yes, starting your own HTTP server reduces the likelihood of a race but still, it is possible thus it does not solve the problem completely. On the other hand, it is more efficient because you only have to have one copy of the data on the wire. But it makes your code more cumbersome so decide for yourself. I recommend starting by using a PushGateway and only then if it becomes unbearable, switch to running an HTTP server with metrics yourself.

If you are making your own stress testing program in Golang, consider using the net/http package because it does the job and you most likely do not need anything fancy.

Use a sliding window

Use a sliding window to be able to specify max samples per q.

As the maximum amount of samples returned per query makes the biggest impact, you must limit the amount of samples returned somehow. After a bit of thinking and testing, it seems to me that using a “sliding window” technique leads to the best results generally – the code looks clean and it is understandable. The method is essentially this:

  • continue pushing metrics from time X;
  • at each point Y, limit the query to the time range from X to Y. If needed, push the beginning of the “time window” so that only a fixed amount of samples would be covered.

This is all made possible by knowing the scrape interval. Add it as a parameter to your program. By knowing that, you can divide the amount of time that has passed since the beginning by it and know how many samples have been pushed of Z amount of time series.

Divide the work

If you are feeling extra fancy or you do not have a single machine that has a lot of threads of execution for all the work, consider dividing the work over a lot of nodes.

For this use-case, you could use libraries such as SCOOP. I do not have any particular recommendations here as it depends on what language you are writing your capacity planning (stress testing) program in.

Check the results

Make it an option to either check or not the results of all of the queries that your program is making. Clearly, the program might start dealing with a lot of data and it might not make sense checking all of the data returned since it can easily take lots of CPU time, and thus the quality of your data will drastically drop.

Imagine having 20 concurrent queriers, each of them dealing with 500000 samples. You would have to deal with a lot of data. This is especially true for bigger organizations. If that is disabled then, honestly, just check if you got the required amount of time series and you successfully retrieved a correct amount of samples. But, do not check the exact data if the user explicitly disabled it.

Miscellaneous pain points

Thanos Store sample limit

This is a bit unrelated but the Thanos Store component of the whole Thanos system does not, unfortunately, provide a way to limit the number of samples returned per each query. This means that practically some query could knock down all of your systems if it will request a lot of data.

For example, imagine someone sending a query like {__name__=~".*"}. This would retrieve all of the data, technically, as each metric has a special label called __name__which contains the name of the metric itself. Thus, it would match everything. This is orthogonal to the whole long-term storage approach where each query is supposed to only query a small piece of data from the supporting storage and not the whole database.

Hopefully, this will be remediated in the near future or maybe I will work on it if I will have the time.

Remote read API inefficiency

Remote read API is highly inefficient which is used by the Thanos project. It is a one-shot API i.e. the whole response is prepared at once and sent back to the client. This means that you will have a couple of buffers with the same data, essentially. At the minimum, you will need three buffers – one for the query results, one for the results in the API response format, one for the results in the API response format and compressed with snappy.

Luckily, this will change in a version of Prometheus in the near future. A lot of work is being done in this part since the pain can be felt by a lot of people. Thanos folk are helping to drive this process forward too – they even had a discussion about it at the Prometheus developer summit.

As always, happy hacking! Let me know if you have any comments or if you have spotted any issues.