It is almost a mantra in the quality assurance world to always add a test case to your unit, integration, or any other tests whenever you find a new error in your software product which checks for exactly that case. Over time I have noticed that the same principle should be applied to monitoring.
Consider adding new alerting rules whenever you run into anomalous behavior and afterward you see the metrics which have indicated it. Treat it as if they are tests but they are “real-time” tests which provide not just more confidence in your software but also more transparency. After all, all tests do not show that there are no bugs in your program but that at least those defined use cases work:
After a while (and I am talking about only a few months of evolution) you will have a collection of tests which do, in fact, prove the absence of many bugs. Not all of course, but any relevant ones will be covered. — DonWells
It would be nice if we could add all of those alerting rules at the beginning but unfortunately, that is not possible because the connections between different metrics increase exponentially just like the different number of states that your program might have. However, not all is lost because we can at least add “simple” alerting rules which clearly show that something wrong is going on e.g. the amount of responses with HTTP code 500 has increased over the last 5 minutes.
It seems to me that in the future we might get that kind of alerting rules together with the software. One caveat is that the syntax of the alerting rules is different for almost each monitoring system. Hopefully, something like OpenMetrics project will be able to change the status quo. After all, metrics are just floating point values with timestamps and labels (a hash map of string pairs) attached to them. It should not be hard to invent a new, platform-agnostic syntax for these things.
Looks like that is already happening bit by bit. Developers and users are starting to use things like jsonnet
to automate the generation of alerting rules, dashboards. For example, there is an initiative called “mix-ins” which are essentially small archived packages with jsonnet
scripts and their supporting libraries so that you could install a “mix-in” for your software. However, it is still in beta but the future is looking bright.