_______ _______ _______
( ____ \( ___ )( )
| ( \/| ( ) || () () | SIMILAR
| (_____ | (___) || || || |
(_____ )| ___ || |(_)| | ALERTS
) || ( ) || | | |
/\____) || ) ( || ) ( | MANAGER
\_______)|/ \||/ \|
(sorry, I do not have a professional designer at my disposal)
Why?
At the moment, Prometheus only supports a rudimentary way to look up what alerts have been firing in the past and there is no way to “persist” those metrics: there is a synthetic metric called ALERTS
which shows what alerts were firing in the past. That is not enough to be able to tell globally what alerts have been firing after Prometheus has restarted. You could use a full-fledged solution like Thanos to solve this problem however that is very cumbersome if you only want this small feature and…
Furthermore, it is hard to tell how different alerts are related to each other. Right now, alerts are differentiated by their label sets. Yes, we could look at a graph of different ALERTS
values but it would require a lot of squinting and deducing to see how different alerts are related because some of them might be firing repeatedly a few times in the past and thus at different points in time they could be related to different alerts, and so on. Plus, if you are using something like Thanos and different Prometheus instances have the same alert rules then it might be that you will have almost identical metrics in that graph which will make things even harder.
Thus, something was needed which would look at those historical alerts, persist the data, and aggregate them so that it would be easy to look up similar alerts. This is where the similar alerts manager comes in.
What?
Similar alerts manager or SAM, in short, is a daemon which sits in the background that periodically retrieves new alert information, parses them, and saves the information into a cache, and provides an HTTP API which permits the users to access this information.
It is a side project and thus it is not very polished however it does its job. Also, this project is somewhere between having nothing in this regard and using a fully fledged solution that is provided by a start-up like SignifAI.
New alert information is retrieved from an Elastic Search cluster that is specified by the user. New alerts information are added there via alertmanager2es
which hooks into AlertManager. We could directly retrieve alerts from AlertManager by implementing an HTTP server ourselves but I feel that pushing everything to an Elastic Search cluster means that it is easier to discover through other means such as a Kibana instance.
Related alerts are retrieved by always keeping a stack of hashes of the label sets of the firing alerts, and whenever a new alert comes in and it is firing then all of the currently firing alerts are related to it.
For the persistence layer, I have chosen to try out Redis which I have never used. It is indeed very elegant and the command model fits very nicely with the use-case. However, at the current version, only the string will all of the information is saved into Redis but in the future, it might be reworked so that proper hash-map commands are used instead of doing it the brain-dead way as it is right now.
I have chosen the Go programming language to do this project as I feel that it is easy to produce easily understandable, concurrent code, and I have a feeling that it is easy to become productive with it. Maybe for the next projects, I will choose something else like Elixir or Rust to try new things 🙂 I know that it is controversial in some regards such as error handling and generics but I still like it as it feels like it is a spiritual successor to the C programming language.
Here is how SAM’s architecture looks like:
Lessons learned
Honestly, the first lesson that I have learned is that for high-level languages such as Go there are a lot of different frameworks that you should really try out and should not try to reinvent the wheel. It is really amazing how much good free software is out there and you should not be afraid to reuse other stuff in your side projects.
Secondly, sometimes the Go’s visibility rules are a pain. For example, I want to have a separate package which is responsible for the cache-related functions. However, because it needs to read all of the state’s members, I cannot make them private. We could make a custom marshaler for that type but that is painful and so the proper fix here is to move the saving of alerts’ information into a completely separate part and put it into a hash-map in Redis. This way, we can hide the alerts information and reduce the usage of RAM because we would not have to store everything.
Thirdly, travis
is a pretty cool tool. I feel like it is somewhere between Drone CI
and a full blown thing which lets you do custom things like Jenkins or TeamCity. travis
covers already most of the use-cases by having already made templates for projects with different programming languages. For example, for Go you may only need to specify language: go
and that will be it. Everything else will be handled for you.
Lastly, ElasticSearch can be tricky sometimes because it lets the user specify the refresh interval i.e. the time between the updates of the “view” of the data. This means that you can, theoretically, push new data and not see it in Kibana immediately. This is controlled by the option refresh_interval
. Thus, the lesson is that ElasticSearch is quite advanced software and there sometimes might be knobs that you might have never thought about, that they even exist in the first place.
What now?
Even though SAM works right now but it needs lots of polish and improvements. The amount of attention that it will get depends on if a lot of other people will find it useful, of course. I will fix the issues mentioned soon.
Besides that, SAM is already usable and you can try it out. Just grab it from the Docker Hub by running these commands:
docker pull stag1e/sam
docker run --rm -it -p 9888:9888 stag1e/sam --elasticsearch 'http://127.0.0.1:1234' --redis '127.0.0.1:3333'
You must specify the IP addresses of Redis and ElasticSearch with these options:
- -l / –elasticsearch: the URL to the ElasticSearch server
- -r / –redis: IP and port pair of the Redis server
You can use docker-compose
to automatically prepare a simple deployment for testing purposes:
cd docker/
docker-compose -f docker-compose-dev.yml up -d
That will set up the following things in a simple configuration:
- ElasticSearch
- Kibana
- AlertManager
- Alertmanager2es
- Redis
So that you would be able to try out SAM. After, add the index template by running: ./scripts/add_es_template.sh
and then you can run SAM. For brevity, I will avoid rewriting the same instructions that are available in the repository itself here.
As always, pull requests and bug reports are welcome! Thank you for reading and happy hacking!