Apply The Same Principle to Monitoring Alerts as to Software Bugs

It is almost a mantra in the quality assurance world to always add a test case to your unit, integration, or any other tests whenever you find a new error in your software product which checks for exactly that case. Over time I have noticed that the same principle should be applied to monitoring.

Consider adding new alerting rules whenever you run into anomalous behavior and afterward you see the metrics which have indicated it. Treat it as if they are tests but they are “real-time” tests which provide not just more confidence in your software but also more transparency. After all, all tests do not show that there are no bugs in your program but that at least those defined use cases work:

After a while (and I am talking about only a few months of evolution) you will have a collection of tests which do, in fact, prove the absence of many bugs. Not all of course, but any relevant ones will be covered. — DonWells

It would be nice if we could add all of those alerting rules at the beginning but unfortunately, that is not possible because the connections between different metrics increase exponentially just like the different number of states that your program might have. However, not all is lost because we can at least add “simple” alerting rules which clearly show that something wrong is going on e.g. the amount of responses with HTTP code 500 has increased over the last 5 minutes.

It seems to me that in the future we might get that kind of alerting rules together with the software. One caveat is that the syntax of the alerting rules is different for almost each monitoring system. Hopefully, something like OpenMetrics project will be able to change the status quo. After all, metrics are just floating point values with timestamps and labels (a hash map of string pairs) attached to them. It should not be hard to invent a new, platform-agnostic syntax for these things.

Looks like that is already happening bit by bit. Developers and users are starting to use things like jsonnet to automate the generation of alerting rules, dashboards. For example, there is an initiative called “mix-ins” which are essentially small archived packages with jsonnet scripts and their supporting libraries so that you could install a “mix-in” for your software. However, it is still in beta but the future is looking bright.

Introducing SAM: Similar Alerts Manager (My Side-Project)

 _______  _______  _______ 
(  ____ \(  ___  )(       )
| (    \/| (   ) || () () |   SIMILAR
| (_____ | (___) || || || |
(_____  )|  ___  || |(_)| |   ALERTS
      ) || (   ) || |   | |
/\____) || )   ( || )   ( |   MANAGER
\_______)|/     \||/     \|

(sorry, I do not have a professional designer at my disposal)

Why?

At the moment, Prometheus only supports a rudimentary way to look up what alerts have been firing in the past and there is no way to “persist” those metrics: there is a synthetic metric called ALERTS which shows what alerts were firing in the past. That is not enough to be able to tell globally what alerts have been firing after Prometheus has restarted. You could use a full-fledged solution like Thanos to solve this problem however that is very cumbersome if you only want this small feature and…

Furthermore, it is hard to tell how different alerts are related to each other. Right now, alerts are differentiated by their label sets. Yes, we could look at a graph of different ALERTS values but it would require a lot of squinting and deducing to see how different alerts are related because some of them might be firing repeatedly a few times in the past and thus at different points in time they could be related to different alerts, and so on. Plus, if you are using something like Thanos and different Prometheus instances have the same alert rules then it might be that you will have almost identical metrics in that graph which will make things even harder.

Thus, something was needed which would look at those historical alerts, persist the data, and aggregate them so that it would be easy to look up similar alerts. This is where the similar alerts manager comes in.

What?

Similar alerts manager or SAM, in short, is a daemon which sits in the background that periodically retrieves new alert information, parses them, and saves the information into a cache, and provides an HTTP API which permits the users to access this information.

It is a side project and thus it is not very polished however it does its job. Also, this project is somewhere between having nothing in this regard and using a fully fledged solution that is provided by a start-up like SignifAI.

New alert information is retrieved from an Elastic Search cluster that is specified by the user. New alerts information are added there via alertmanager2es which hooks into AlertManager. We could directly retrieve alerts from AlertManager by implementing an HTTP server ourselves but I feel that pushing everything to an Elastic Search cluster means that it is easier to discover through other means such as a Kibana instance.

Related alerts are retrieved by always keeping a stack of hashes of the label sets of the firing alerts, and whenever a new alert comes in and it is firing then all of the currently firing alerts are related to it.

For the persistence layer, I have chosen to try out Redis which I have never used. It is indeed very elegant and the command model fits very nicely with the use-case. However, at the current version, only the string will all of the information is saved into Redis but in the future, it might be reworked so that proper hash-map commands are used instead of doing it the brain-dead way as it is right now.

I have chosen the Go programming language to do this project as I feel that it is easy to produce easily understandable, concurrent code, and I have a feeling that it is easy to become productive with it. Maybe for the next projects, I will choose something else like Elixir or Rust to try new things 🙂 I know that it is controversial in some regards such as error handling and generics but I still like it as it feels like it is a spiritual successor to the C programming language.

Here is how SAM’s architecture looks like:

architecture

Lessons learned

Honestly, the first lesson that I have learned is that for high-level languages such as Go there are a lot of different frameworks that you should really try out and should not try to reinvent the wheel. It is really amazing how much good free software is out there and you should not be afraid to reuse other stuff in your side projects.

Secondly, sometimes the Go’s visibility rules are a pain. For example, I want to have a separate package which is responsible for the cache-related functions. However, because it needs to read all of the state’s members, I cannot make them private. We could make a custom marshaler for that type but that is painful and so the proper fix here is to move the saving of alerts’ information into a completely separate part and put it into a hash-map in Redis. This way, we can hide the alerts information and reduce the usage of RAM because we would not have to store everything.

Thirdly, travis is a pretty cool tool. I feel like it is somewhere between Drone CI and a full blown thing which lets you do custom things like Jenkins or TeamCity. travis covers already most of the use-cases by having already made templates for projects with different programming languages. For example, for Go you may only need to specify language: go and that will be it. Everything else will be handled for you.

Lastly, ElasticSearch can be tricky sometimes because it lets the user specify the refresh interval i.e. the time between the updates of the “view” of the data. This means that you can, theoretically, push new data and not see it in Kibana immediately. This is controlled by the option refresh_interval. Thus, the lesson is that ElasticSearch is quite advanced software and there sometimes might be knobs that you might have never thought about, that they even exist in the first place.

What now?

Even though SAM works right now but it needs lots of polish and improvements. The amount of attention that it will get depends on if a lot of other people will find it useful, of course. I will fix the issues mentioned soon.

Besides that, SAM is already usable and you can try it out. Just grab it from the Docker Hub by running these commands:

docker pull stag1e/sam
docker run --rm -it -p 9888:9888 stag1e/sam --elasticsearch 'http://127.0.0.1:1234' --redis '127.0.0.1:3333'

You must specify the IP addresses of Redis and ElasticSearch with these options:

  • -l / –elasticsearch: the URL to the ElasticSearch server
  • -r / –redis: IP and port pair of the Redis server

You can use docker-compose to automatically prepare a simple deployment for testing purposes:

cd docker/
docker-compose -f docker-compose-dev.yml up -d

That will set up the following things in a simple configuration:

  • ElasticSearch
  • Kibana
  • AlertManager
  • Alertmanager2es
  • Redis

So that you would be able to try out SAM. After, add the index template by running: ./scripts/add_es_template.sh and then you can run SAM. For brevity, I will avoid rewriting the same instructions that are available in the repository itself here.

As always, pull requests and bug reports are welcome! Thank you for reading and happy hacking!