ABCs of Logstash capacity planning

Recently I had to do some capacity planning of this software that is relatively popular and it stands for the L in the ELK (Elasticsearch, Logstash, Kibana) stack so I thought that I should share what I have learned. While researching, I have found an appalling lack of information regarding this matter so I hope that this article will at least a bit help fill this void on the Internet.

Dimensions

Logstash is essentially the software which lets users send some kind of data to it, it then parses it, and it sends the data off to one or more defined outputs. Thus, we have to consider capacity planning from these perspectives:

  • CPU is needed to parse it as fast as possible
  • RAM is needed to store all of the data while we parse it and send it off
  • Storage is needed for extra resilience, temporary storage

Planning

Before ingestion

Typically, in Logstash pipelines, the defined inputs are always constantly listening for, for example, new connections through which users would send their data.

That is not ideal – we want to gradually drop off the excessive load and tell our users to send their data again when our Logstash instances will not be overloaded again. This is even outlined in Google’s popular SRE book.

Logstash provides this capability by having what is called a “persistent queue” of data. All of the data that is ingested in each pipeline with this mechanism enabled is first stored on the disk up to a certain limit, and then it is sent off to the workers for processing.

Enable this in your configuration:

queue.type: persisted

And then consider playing around with these options:

queue.max_bytes: 1024mb
queue.page_capacity: 64mb

I have shown the default values in this previous block but you should tune them according to your needs. Obviously, setting the maximum bytes setting to a higher value will lead to higher disk usage. You can find more info here.

After ingestion

Now, the data has been ingested – how can we control this whole process further? It then comes to the filtering plugins.

Mainly, these filtering plugins are used:

  • throttle which, obviously, throttles the incoming events by marking them with a certain tag and, optionally, replays them
  • drop which can be used to drop events
  • truncate which you can use to truncate the data and only leave some number of bytes
  • prune which will let you drop certain fields

You can use the throttle and drop plugins to limit the number of messages in a pipeline. This is left as an exercise to the reader.

Always be cautious of the latency and CPU usage metrics. You do not want to add too many actions into your pipelines as that increases the usage of those resources.

If your I/O is fast enough i.e. CPU isn’t spending much time on iowait, the CPU load as indicated by the uptime command should be close to the number of execution threads in your CPU(-s). Consider adding an alert on the time it takes for an event to pass through the whole pipeline.

RAM usage

The top limit of the RAM usage of Logstash is, obviously, the limits set on the Java virtual machine via the command line parameters -Xmx and -Xms since it runs on it. Find more information on this Stackoverflow answer.

However, from Logstash’s point-of-view, the RAM usage depends on the size of the buffers which are used for ingestion, the number of workers, the batch size, and the size of the fields on the messages.

The size of buffers is dependent on the input plugin that you are using. For example, the kafka input permits you to select a custom-size buffer: you should increase the option receive_buffer_bytes if you want a bigger buffer.

The number of workers is tuned via the pipeline.workers parameter. If you are using Puppet then it should probably be set to the fact processors[count] (number of threads of execution) as described here.

Batch parameters are tuned by pipeline.batch.*. You ought to increase their values if your individual CPU threads of execution are relatively fast. If you increase these options then you might want to also increase the -Xmx JVM option as described before.

Finally, the last thing impacting the RAM usage is the number of fields in any event and their size. It is hard to know in advance how big they might be but you can reduce their length up to a certain length by using the truncate plugin described before.

You will probably end up using this either way since the output is typically set as elasticsearch and ElasticSearch has intrinsic limits on the size of each document. However, feel free to be more strict on certain fields!

Conclusion

Logstash provides straightforward ways of tuning its RAM usage. Here I have presented all of the most commonly used options in one place. Tune them according to your resiliency demands and the availability of capacity in the formerly mentioned aspects.

As always, let me know if you have enjoyed it or if I have missed something, or made a mistake! Thanks for reading ❤️.

What I would love to see in the Puppet configuration management software

Image result for puppet
We can make puppets prettier and more functional

I have been using Puppet off-and-on by now for almost a couple of years and saw it transform from a domain-specific language with lots warts to a pretty and functional language. However, not all is completely fine and dandy. I feel that in some ways it could still be improved a lot.

Namespaces support of modules in environments

Puppet has somewhat good support for namespaces in the code with the special delimiter of :: which has a magical meaning for it: the Puppet agent expects to find the code in one level deeper when it sees such thing in a class name or a define. From now on, I will only talk about class objects and not other things which work similarly from the point of view of the parser. However, at the top level, all names of the class definitions (before ::, if it exists) are at the same, top-most scope. In a very big environment, you could easily run into such a situation that two different modules (competing implementation) configure the same piece of software and they have identical names.

This StackOverflow question shows that such a problem is not uncommon at all. The only “solution” that you have right now is to fork that module into your own repository and rename all of the class names into something like: foo_nginx or bar_sssd.

The Puppet’s code already expects a somewhat rigid structure of your environment and it probably would not be painful to add another special separator in the include or contain statements. For example, it could look like this: modulename/nginx::vhost. Such syntax would follow the same naming rules as in Puppetfiles.

Better tooling

Puppet could use some better tooling. The implementation of the server itself comes with the JRuby distribution of Ruby. I am not completely sure why it was chosen but JVM generally feels sluggish. However, there is nothing inherently bad in that. The issue here is that the de facto most popular Ruby testing framework RSpec is not totally thread-safe. For example, this issue has been open for a few years by now. I hope that it will not turn into something like MySQL’s bug #11472 which is 10+ years old now!

The problem here is that if you want to test out Puppet’s code, you have to use the same JRuby because some things work a bit differently in it, especially with regards to things which call different C libraries e.g. openssl. This means that all of your tests in puppet-rspec need to be executed sequentially i.e. they need to be written one-by-one in different it blocks!

Also, on the same note, the popular r10k environment deployment tool does all of its actions sequentially. That makes it take a very, very long time to deploy new environments – in the minutes. Fortunately, smart people have managed to fork it and make the deployment work concurrently: g10k.

It also has the complementary r10k-webhook here which gives batteries to r10k and permits deployment of new environments on new commits in some kind of repository, for example. Unfortunately, it has lots of problems as well like:

  • new environments are exposed to the Puppet server before they are fully deployed which can lead to spurious errors;
  • the operations happen synchronously so the execution can be canceled from the client’s side i.e. GitHub, if it is taking too long — this can easily happen in big environments

Replacements can be written by the community for these tools which will solve these problems elegantly however it would be nice if it came from Puppetlabs themselves.

The movement towards “immutable” infrastructure

Last but not least, let’s talk about the movement towards “immutable infrastructure” in the view of the whole DevOps movement with tools like Terraform being on the wave. Obviously, configuration management is only for ensuring that certain actions are performed which leads to a certain configuration but they, obviously, do not check and revert all of the previous actions that have been done manually. This is where “immutable infrastructure” comes in.

But… do tools like Puppet and Salt still have a place in the modern IT world when we have such things as Kubernetes? I would say that the answer is yes – something still needs to stand up the machines and images which run those Kubernetes clusters. Even if we are just standing up stateless machines with images built from somewhere else – those images still need to be built repeatably and fast.

This is where software like Packer comes into place. It has support for lots of different provisioners and one of them is Puppet. Thus, as time goes on, we will still have competitors and innovation in this space.

I have included this section in this post because I feel that configuration management does not encourage “immutable infrastructure” enough. Sure, we have more automation on top like Spinnaker which helps out with tearing down everything and pulling up everything again but I feel that this has not been emphasized enough.

As time goes on, the state of your machines will inevitably diverge from the things you have in your code. Of course, Puppet does not go so deep but I feel that maybe such tools should have to be more sophisticated and somehow nudge their users to at least periodically build up everything from scratch from the code they have in their repositories. Nightly tests of Puppet profiles/roles and chaos monkey tests is a somewhat possible answer to this however not everyone does this. I think that perhaps it would be cool for tools such as Packer to get support for overlay filesystems: the underlying, read-only filesystem would be prepared by a provisioner like Puppet, and then all of the mutable things would be performed on an “overlay” over it. Time will tell.

Thanks for reading and let me know your thoughts!