2018 in books

Intro

2018 is coming to a close and so I thought that it would be a good idea to again look back on the books that I have read in 2018 and to share what books I have liked the most with my readers. I will mark the most liked books in bold like Aaron Swartz did in his lists. Also, I will expand a bit under the books which I liked the most and which are kind of controversial.

Perhaps this will be inspiring to someone or they will recommend me similar books that I must definitely read. As I have written already, reading (programming) books is a really rewarding hobby since it helps you grow holistically as a person. You could find my 2016 reads here.

2018 list

I have gone through a total of 14 (+4 compared to last year’s count. I guess it is because I spent a lot of time in 2017 on my education) books in 2018. They have 4843 pages in total which is not bad.

  • “The Pragmatic Programmer” by Andrew Hunt and David Thomas;

A pretty important book in its own regard. It is not opinionated but it talks about various certain lessons that the authors have learned over tens of years of experience of being in the computer software field. It is one of those books that you would read before falling asleep. The exercises are kind of simple but they make you think about what you just read.

  • “To Kill a Mockingbird” by Harper Lee;

A very important piece on topics such as racism and discrimination; totally immersive writing which makes you feel like you are there and it hooks you into reading more and more. Read the Lithuanian version for a change. Atticus Finch is really a hero. My review as posted on Goodreads:

Honestly, 5/5. No comments. No wonder Atticus’ story has inspired thousands of attorneys and it has been rated by British librarians as a book to read before you die even above the Bible.

  • “Surely, you’re joking, Mr. Feynman!” by Richard P. Feynman;

Adventures of a curious character, as the book says. Indeed, popular scientists like Feynman sometimes might seem like they are not human beings anymore, that they had transcended us. However, books like this give us a glimpse into such life and we can see that after all, we are not so different and that we too can achieve such things. In general, it is a very inspiring book. Some people complain about the egoistic tone at some places but I think that if you look past it, you can definitely find a very captivating story.

  • “Italian short stories for beginners” by Olly Richards;
  • “Murder On The Orient Express” by Agatha Christie;
  • “Algorithms to Live By” by Brian Christian;
  • “Inside the Nudge Unit” by David Halpern;

Essentially it is a book about bringing back the scientific method to governmental decisions. It is very interesting to read about this and before this, I never knew that such units even exist. They mostly started using a form of A/B testing to the governmental communication to improve its efficacy. And, surprisingly enough, it has improved a lot and barely any money had been spent on this. It shows that sometimes we do not have to look deeply to find issues which are not hard to fix and would bring a lot of benefits if fixed.

  • “The Soul of a New Machine” by Tracy Kidder;
  • “Rich Dad, Poor Dad” by Robert T. Kiyosaki;

Essentially this book is about a change of the mindset: you should view your money as an asset, not as a liability. I loved this since me and the majority of other people get stuck in this rut of life where we can only see the short-term goals, and we cannot wait until the next pay-check. This book will teach you how to change that thinking and could be a good start of beginning to acquire more and more assets which would generate you more and more money. The main idea is to make money work for you but not work for money.

  • “Meditations” by Marcus Aurelius;
  • “Dataclysm” by Christian Rudder;
  • “Code Complete” by Steve McConnell;

You can find the top list of things that I have learned from this book here. I wrote that blog post at the beginning of 2018… that is how good the book was 🙂 I feel like this is a gentle step forward after “The Pragmatic Programmer”. Afterward, I would recommend everyone to delve into some kind of programming language or paradigm specific book as to learn the nitty-gritty details. Or, you should read through and do exercises of some kind of algorithm book.

  • “Never Split the Difference” by Chris Voss;

The stories and lessons thereof of a former FBI negotiator. You might think you know human psychology but real us actually shine under stressful and life-or-death situations such as those that were a daily encounter for Chris Voss. Contains a lot of golden advice for negotiation and just in general day to day life because actually, we negotiate all the time even though sometimes we might not recognize it. My Goodreads review:

Everything you wanted to know about negotiation. The author writes in a very clear, lucid style. Reminded me of Richard Dawkins’ style of writing. This book will introduce you to the concepts of labelling, mirroring, all kinds of leverage that you could peruse to your advantage, and of course – the black swans. If you always thought that negotiation is something only people in FBI and other agencies do – you are wrong. It is worth for everyone to pick up this book because negotiations happens all of the time in our lives. As the author puts it (I’m paraphrasing here) – conflicts happen each day and you cannot avoid them. So stop thinking of your partner as your adversary and think about them as of your counter-part. The adversary is always the problem or idea being discussed. Very awesome book.

  • “The Power of Now” by Eckhart Tolle

A controversial book. Kind of reminds me of Stoicism. The point that nothing exists besides now is kind of appealing to me and I think that it has a kernel of truth. Ignore the religion stuff while reading it. My review from Goodreads:

I think that the mindfulness parts and the idea that nothing exists besides what is now have a kernel of truth to them. Practicing such a view of life definitely helps to be more calm and view things as they are. In my opinion, this is related to the Stoic view of life in which it is being said that the only thing that you can influence is you yourself and your reactions. However, I did not like some of the explanations that involved “God” or “the Lord” himself because they, to be frank, just do not make any logical sense and because “God” is dead. On the other hand, just like the author says: “words only convey some kind of meaning, they themselves are worthless and completely made up” (paraphrasing).

To read in 2019

One of the first books that I want to read in 2019 is “The Design of Everyday Things” by Donald Norman. I feel that it is useful for programmers and everyone involved to read a bit of literature about usability and user experience. I saw it recommended in a lot of lists.

Afterward, I will try to read something related to algorithms like “Introduction to Algorithms” (CLRS). I feel that algorithms and estimating the complexity of our programs is very important thus I will brush up on that.

Finally, I want to read something related to machine learning or artificial intelligence. These fields are very fascinating to me and, honestly, I do not have that much knowledge of them. We will see what actual books I will choose.

Capacity planning of Prometheus 2.4.2 + Thanos Sidecar 0.1.0

As you can see, capacity planning is at the top of the Ops section.

Intro

Having a monitoring infrastructure is one of the tenets of the DevOps world. And, it seems that Prometheus and all of its extraneous integrations such as Thanos or Uber’s M3DB is taking over that world slowly. They give you a lot of improvements over the old Graphite set-ups such as increased reliability since Prometheus is based on HTTP which uses TCP thus you explicitly know when something goes wrong compared to UDP; it uses a pull based model so you explicitly know when something goes wrong and so on.

Thus, with all of the new users adopting it, a question begs to be asked: how do we properly plan our system’s capacity so that we would withhold all of the possible queries? This is, in general, very important in the DevOps world.

I learned a lot about doing that with Prometheus and Thanos Sidecar over the past few months so let me share with you my insights. Hopefully, you will find this useful.

All of the following thoughts are based on Prometheus 2.4.2 and Thanos 0.1.0. It might and will most likely change in the future version(-s).

I will not mention other things that may not be as important as other things like disk throughput. The majority of modern hardware should be good enough for Prometheus deployments of any size.

Most important components for Prometheus + Thanos Sidecar

CPU

The CPU usage is deeply impacted by the actual content of the PromQL queries that are being executed. To be even more exact, what matters is what kind of (if any) aggregation operators or math functions you are using in the queries. Obviously, functions such as time() do not cost a lot since you only the Unix timestamp of the current time.

Naturally, it follows that functions which use a range vector use more CPU time than those which take an instant vector as you have to iterate over more than 1 value (usually). Obviously, even functions that take instant vectors can be costly such as the sort() function. Also, because it is hard to intercept PromQL queries by their content as they are coming in, it does not make sense to talk much about the different functions but stuff like holt_winters() and quantile_over_time() indeed takes the most time.

This hooks up into the other functionality of Prometheus: it has a concept of alerting rules. They are being periodically evaluated against the data and they transition from pending to firing. Alas, if you do a lot of number crunching in different expressions, you can move the common part to a what is called a “recording rule”. Then, you can use the resulting name in all of the expressions where you need that value. This avoids recalculating the same thing over and over.

As for Thanos Sidecar, this does not impact it whatsoever since it only passes on the results to Thanos Query which actually computes the final results having the (deduplicated) time series data after evaluating the query.

This, in practice, means that for a separate Prometheus + Thanos Sidecar deployment, it does not matter at all what actual queries the user is sending besides the number of time series that they return.

RAM

Roughly, the RAM usage is equivalent to the sum of the:

  • resident TSDB size on-disk because Prometheus internally uses mmap(2) so it is possible that all data that is on disk may be allocated in the memory too (applies only to Prometheus). Note that in this the new metrics ingestion rate is included as well because all new metrics data is added to the WAL which uses the same principle as the rest of the TSDB mechanism;
  • and buffers used for the responses and requests (applies to both Thanos Sidecar and Prometheus).

The first item is a more static size whereas the second one is more dynamic. The Prometheus options of limiting the amount of concurrent requests and the number of samples that they return through different interfaces comes into play here. Consider limiting them even more if you have a smaller machine after you stress test it and get the results. The most important options are:

  • –storage.remote.read-concurrent-limit
  • storage.remote.read-sample-limit
  • –query.max-concurrency
  • –query.max-samples

Stress testing existing Prometheus set-ups

Such things as prombench and thanosbench already exist which benchmark the Prometheus and Thanos software respectively. However, that software is written with the assumption that the user has an access to some kind of Kubernetes cluster where everything will be performed. Unfortunately, that is not always the case.

To stress test existing setups, you might use features such as the Consul service discovery for Prometheus. Then, you push some fixed amount of data into it and query it through Thanos Sidecar. That way all of the parts of the system will be tested – the remote read interface, ingestion, and so on.

It might look like this:

This is a simplified view, obviously. Consider using Consul service discovery or some other mechanism to discover the machine that is running the tests. It will make your life much easier.

Re-use pushgateways

The race conditions between the scrapes and the HTTP server is inherent so re-use PushGateway to make life easier. Yes, starting your own HTTP server reduces the likelihood of a race but still, it is possible thus it does not solve the problem completely. On the other hand, it is more efficient because you only have to have one copy of the data on the wire. But it makes your code more cumbersome so decide for yourself. I recommend starting by using a PushGateway and only then if it becomes unbearable, switch to running an HTTP server with metrics yourself.

If you are making your own stress testing program in Golang, consider using the net/http package because it does the job and you most likely do not need anything fancy.

Use a sliding window

Use a sliding window to be able to specify max samples per q.

As the maximum amount of samples returned per query makes the biggest impact, you must limit the amount of samples returned somehow. After a bit of thinking and testing, it seems to me that using a “sliding window” technique leads to the best results generally – the code looks clean and it is understandable. The method is essentially this:

  • continue pushing metrics from time X;
  • at each point Y, limit the query to the time range from X to Y. If needed, push the beginning of the “time window” so that only a fixed amount of samples would be covered.

This is all made possible by knowing the scrape interval. Add it as a parameter to your program. By knowing that, you can divide the amount of time that has passed since the beginning by it and know how many samples have been pushed of Z amount of time series.

Divide the work

If you are feeling extra fancy or you do not have a single machine that has a lot of threads of execution for all the work, consider dividing the work over a lot of nodes.

For this use-case, you could use libraries such as SCOOP. I do not have any particular recommendations here as it depends on what language you are writing your capacity planning (stress testing) program in.

Check the results

Make it an option to either check or not the results of all of the queries that your program is making. Clearly, the program might start dealing with a lot of data and it might not make sense checking all of the data returned since it can easily take lots of CPU time, and thus the quality of your data will drastically drop.

Imagine having 20 concurrent queriers, each of them dealing with 500000 samples. You would have to deal with a lot of data. This is especially true for bigger organizations. If that is disabled then, honestly, just check if you got the required amount of time series and you successfully retrieved a correct amount of samples. But, do not check the exact data if the user explicitly disabled it.

Miscellaneous pain points

Thanos Store sample limit

This is a bit unrelated but the Thanos Store component of the whole Thanos system does not, unfortunately, provide a way to limit the number of samples returned per each query. This means that practically some query could knock down all of your systems if it will request a lot of data.

For example, imagine someone sending a query like {__name__=~".*"}. This would retrieve all of the data, technically, as each metric has a special label called __name__which contains the name of the metric itself. Thus, it would match everything. This is orthogonal to the whole long-term storage approach where each query is supposed to only query a small piece of data from the supporting storage and not the whole database.

Hopefully, this will be remediated in the near future or maybe I will work on it if I will have the time.

Remote read API inefficiency

Remote read API is highly inefficient which is used by the Thanos project. It is a one-shot API i.e. the whole response is prepared at once and sent back to the client. This means that you will have a couple of buffers with the same data, essentially. At the minimum, you will need three buffers – one for the query results, one for the results in the API response format, one for the results in the API response format and compressed with snappy.

Luckily, this will change in a version of Prometheus in the near future. A lot of work is being done in this part since the pain can be felt by a lot of people. Thanos folk are helping to drive this process forward too – they even had a discussion about it at the Prometheus developer summit.

As always, happy hacking! Let me know if you have any comments or if you have spotted any issues.