Turbo-charging git-diff and making reviews easier

Are you working with git repositories in which a lot of code churn is happening? For example, do you have a repository with Puppet hierarchical data and there are a lot of pull requests where code is just being moved around between layers of hieradata? diff(1) ordinarily only differentiates between added and removed lines with green and red colours (certainly the user controls what are actually those colours) however git-diff(1) recently, at the end of 2017, got a new functionality where text that is just moved around can be highlighted in different colours. This is my case and this feature was a godsend because I did not want to spend a lot of time reviewing the syntax side of things if the hieradata is just being moved. That is why I wanted to share it with you so that it could help you out and perhaps someone has some even more good tips on how to make the reviewing process more effective.

This is how it ordinarily looks like in git-diff(1):

With the default diff.colorMoved zebra mode you can get something like this git-diff(1):

And with the dimmed_zebra mode:

This feature is controlled by an option called --color-moved or diff.colorMoved in the gitconfig file. If you enable it, the default mode at the moment is the zebra mode that you saw in the screenshots before. git is sensible enough in zebra mode to only apply its algorithm for lines which are longer than 20 alphanumeric characters. That is also illustrated in those screenshots. Amended characters to moved lines are painted with a different color which is why it is called the zebra mode.

Also, two other modes exist – the plain and the dimmed_zebra mode. However, the plain mode is not so useful for our use-case because it does not differentiate between moved lines which were permutated a little bit. It only, as it says, checks if one, exact line was added somewhere else. dimmed_zebra is a bit more interesting – it only highlights when one block of moved text intersects with the text around it. You can try it out and see if that is something useful to you. In my opinion, this is the best mode.

At the end, let me introduce another project that is about making git-diff(1) even more beautiful. It can add some extra highlights or perform some other munges to the changed lines to make it even more clear what is happening. It is called diff-so-fancy. You can start using it simply by executing these commands:

git config --global core.pager "diff-so-fancy | less --tabs=4 -RFX"

Obviously, the diff-so-fancy binary needs to be available in $PATH. This is how the end result looks like if we applied this to our previous diff:

Happy hacking! Find more information here:

scanf(3) is a newbie trap, use fgets(3) with sscanf(3) instead

If you have not known before, scanf(3) and fgets(3) are both functions intended for reading something from standard input and doing something with the result – in the former case it is interpreted and the results might possibly be stored in specified arguments and in the latter case the result is simply put into a buffer. The first one is commonly recommended to beginner C programmers for reading something from the user and parsing it. However, it should be avoided mainly because it is very error-prone and it is difficult to understand how it actually works for new people. Instead of scanf(3), sscanf(3) should be used in combination with fgets(3). That way, you can easily guess in what state your standard input stream is after using those functions and you get other benefits. Let me give some examples and explain more.

The following two examples are functionally the same – they both read two numbers from standard input and print them. However, the first one uses scanf(3) and the second one uses fgets(3). Here they are:

#include <stdio.h>

int main(int argc, char *argv[])
{
    int ret, num;

    ret = scanf("%d", &num);
    if (ret == 1)
        printf("%d\n", num);
    return 0;
}
#include <stdio.h>

int main(int argc, char *argv[])
{
    char buf[100];

    if (buf == fgets(buf, 100, stdin)) {
        int num, ret;

        ret = sscanf(buf, "%d", &num);
        if (ret == 1)
            printf("%d\n", num);
    }
    
    return 0;
}

You can try them out for yourself. Let us go through the list of differences. I think the best way to illustrate them is to go through a list of different kinds of inputs that the user maybe provide to these programs and see what are the differences:

  • Valid input. Let us say the user entered “123\n”. scanf would read “123” from stdin and it would leave the “\n” in the stream. However, fgets would eat the “\n” as well. This is where the first problem occurs: new C programmers tend to think that scanf would read the “\n” as well or they do not realise that it is still there. This might not pose an issue if your stdin is not line buffered (e.g. when piping a file into a program) however most of the time it is – as far as I know most of the terminals wait for the user to press “Enter” before sending the user’s input to a program in the foreground. On the other hand, fgets would leave stdin in a predictable state – you will always know that if ‘\n’ existed in stdin then it was eaten (unless your buffer is too small). And it is very easy to check this hypothesis – just check if the last character in your buffer is ‘\n’. Also, you know that you need to read more characters (digits) from stdin to get the full number if the last character is not ‘\n’ and you still have something to read from stdin.
  • Valid input but with some whitespace at the beginning. scanfautomagically skips over it but however fgets presents an opportunity for you to see what kind of whitespace was at the beginning, before the actual number. Buffer size becomes an important question in this case. As always, read until you got the full line. This might or might not be useful depending on your case. In general, fgets in this case introduces more transparency in the process.
  • Invalid input. This is the case when it does not match the format specifier, characters are read from stdin but they are not brought back. scanfmight accidentally eat a character from stdin and it would be gone into the dark abyss unless it was stored in a buffer somewhere. If I remember correctly, it disappears as long as you read it from stdin but maybe on some Unix you can read it back again. This raises a confusion for the user because the user does not know how much was read from stdin exactly. The other function combination lets you know exactly what was read from stdin. This might cause even more confusion when two or more pairs of scanf are used one after the other because it is hard to know what is left in stdin after the preceding calls to scanf.

In general, my recommendation is this: avoid using scanf unless you can absolutely control what the standard input is going to be and what will be its format. Also, besides all of the aforementioned problems, scanf does present some security issues. For example, the format "%s" lets the user input any length string into a specified char *. A malicious user can easily use this to over-run the buffer and write any arbitrary data to memory. I hope you will take this into account the next time you will write code such as this.