Running systemd Units Only If Certain CPU Features Are Available

systemd's logo (C) https://brand.systemd.io/
systemd’s logo

Do you have some kind of application that should only be executed when the user’s CPU has a certain feature? Or maybe your application has horrible performance unless some certain instructions are available? As you might know already, the Linux kernel exposes this information via /proc/cpuinfo:

processor       : 15
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 96
model name      : AMD Ryzen 7 PRO 4750U with Radeon Graphics                                                                                                                                                                                  
...
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd _apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce to poext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xge tbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

Where does this data come from? One source of information is the cpuid instruction. On the x86 architecture systemd starting from v248 is now able to start or stop an unit based on the features exposed by the CPUID instruction. Enter ConditionCPUFeature. On CPUs that do not have the CPUID instruction i.e. ARM CPUs, it is assumed that the CPU implements no features and thus any potential CPUFeature conditions will fail.

The feature strings that systemd understands could be found here. So, for example to ensure that your service only runs when SSE 4.2, you should add this to your unit file:

[Unit]
ConditionCPUFeature=sse4_2

To negate this condition, put a exclamation mark in front. So, the same example but inverted would look like this:

[Unit]
ConditionCPUFeature=!sse4_2

This would ensure that the unit could only run if SSE 4.2 is not available.

Finally, it is worth mentioning that this assumes a homogenous system – i.e. all of the available CPU cores implement the same CPU features. In fact, the Intel CPU manuals & other manuals only typically assume that this is case because that’s true in most of the cases, as far as I can tell.

This could be improved in the future by checking all of the available CPU’s CPUID flags. On Linux this could be implemented by reading /dev/cpu/*/cpuid or explicitly scheduling a process on different CPUs & reading the CPU flags then. Or maybe systemd could even provide a feature where a unit could be scheduled only on CPUs which provide one or more provided features. If you are interested to know more then this article here provides good information.

I liked working on this feature in my own free time because I wanted to learn more about how these things work. I knew a bit about CPUID in the past however I never delved deep into it. Implementing this gave me valuable information about how stuff like identifying CPU’s features functions.

Surprising Optimization When Appending Elements to an Array in ZSH

Recently I ran into a surprising ZSH performance bottleneck while editing my .zshrc. Because it gets loaded every time a new shell comes up, I have noticed the issue pretty fast. I quickly started digging into the problem. Apparently, this surprising behavior concerns appending new elements to an array. I did not have a huge file that I was trying to read, only about 6000 elements or so. But still, it had a significant impact on how fast the Z shell was able to start up.

Imagine a simple loop such as this:

#!/bin/zsh
while read line; do
  echo "${line}"
done < ~/somefile

And ~/somefile has been prepared with:

#!/bin/zsh
for i in `seq 1 8873`; do
    UUID=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 32 | head -n 1)
    echo "${UUID}" >> ~/somefile
done

Then, let’s read the same file into an array with two different scripts:

#!/bin/zsh
arr=()
while read line; do
    arr=(${arr} ${line})
done < ~/somefile
#!/bin/zsh
arr=()
while read line; do
    arr+=(${line})
done <~/somefile

Could you guess which one is faster? If your answer is the 2nd one then you are correct. At least on my machine:

$ time ~/testcase_fast.sh
~/testcase_fast.sh  0,17s user 0,11s system 98% cpu 0,279 total
$ time ~/testcase.sh
~/testcase.sh  15,68s user 0,95s system 98% cpu 16,881 total

A simple ltrace(1) comparison reveals where the time is actually spent:

% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 34.49   17.657872          60    292810 read  
 25.07   12.833696          43    292815 mbrtowc
  8.41    4.304812          60     71076 sigprocmask
  7.80    3.994101          43     91362 strlen     
  6.99    3.580468          43     82299 strcpy
  4.04    2.066031          43     47979 malloc
  3.78    1.936972          42     45209 free
...

And the “slow” version (not the full run here since it takes a very long time however the picture is clear):

% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 28.21  177.575768          44   4006596 strlen     
 28.11  176.938217          44   4000420 strcpy
 14.16   89.127826          44   2002599 malloc
 14.01   88.177566          44   1996322 strchr
 13.99   88.058532          44   1999451 free  
  0.62    3.915208          59     65835 read   
  0.45    2.844029          43     65841 mbrtowc  
...

My original assumption was that ZSH would “understand” that I am simply adding a new element at the end and thus avoid copying. To be honest, I haven’t even suspected this part when I had started investigating this. However, clearly, that’s the culprit. It seems that in the slow version the Z shell is constantly allocating, copying, and freeing memory which takes most of the time – more than 50% of the time. 😱

Truth be told, I have copied this code at first from StackOverflow and adapted it to my own needs:

https://stackoverflow.com/questions/54309712/zsh-doesnt-autocomplete-correctly-my-ssh-command

So, this is yet another reminder not to have too many presumptions about how something works and that snippets of code on the Internet are not necessarily of the best quality.

Strictly speaking, we couldn’t probably even say that we are appending an array in the slower version because the += operator is not being used. On the other hand, the effect is the same so I think it is OK to use the word append here in the general sense.

All in all, you should always use += when appending elements to an array! I hope this helps! Let me know if you have any comments or suggestions!