scanf(3) is a newbie trap, use fgets(3) with sscanf(3) instead

If you have not known before, scanf(3) and fgets(3) are both functions intended for reading something from standard input and doing something with the result – in the former case it is interpreted and the results might possibly be stored in specified arguments and in the latter case the result is simply put into a buffer. The first one is commonly recommended to beginner C programmers for reading something from the user and parsing it. However, it should be avoided mainly because it is very error-prone and it is difficult to understand how it actually works for new people. Instead of scanf(3), sscanf(3) should be used in combination with fgets(3). That way, you can easily guess in what state your standard input stream is after using those functions and you get other benefits. Let me give some examples and explain more.

The following two examples are functionally the same – they both read two numbers from standard input and print them. However, the first one uses scanf(3) and the second one uses fgets(3). Here they are:

#include <stdio.h>

int main(int argc, char *argv[])
{
    int ret, num;

    ret = scanf("%d", &num);
    if (ret == 1)
        printf("%d\n", num);
    return 0;
}
#include <stdio.h>

int main(int argc, char *argv[])
{
    char buf[100];

    if (buf == fgets(buf, 100, stdin)) {
        int num, ret;

        ret = sscanf(buf, "%d", &num);
        if (ret == 1)
            printf("%d\n", num);
    }
    
    return 0;
}

You can try them out for yourself. Let us go through the list of differences. I think the best way to illustrate them is to go through a list of different kinds of inputs that the user maybe provide to these programs and see what are the differences:

  • Valid input. Let us say the user entered “123\n”. scanf would read “123” from stdin and it would leave the “\n” in the stream. However, fgets would eat the “\n” as well. This is where the first problem occurs: new C programmers tend to think that scanf would read the “\n” as well or they do not realise that it is still there. This might not pose an issue if your stdin is not line buffered (e.g. when piping a file into a program) however most of the time it is – as far as I know most of the terminals wait for the user to press “Enter” before sending the user’s input to a program in the foreground. On the other hand, fgets would leave stdin in a predictable state – you will always know that if ‘\n’ existed in stdin then it was eaten (unless your buffer is too small). And it is very easy to check this hypothesis – just check if the last character in your buffer is ‘\n’. Also, you know that you need to read more characters (digits) from stdin to get the full number if the last character is not ‘\n’ and you still have something to read from stdin.
  • Valid input but with some whitespace at the beginning. scanfautomagically skips over it but however fgets presents an opportunity for you to see what kind of whitespace was at the beginning, before the actual number. Buffer size becomes an important question in this case. As always, read until you got the full line. This might or might not be useful depending on your case. In general, fgets in this case introduces more transparency in the process.
  • Invalid input. This is the case when it does not match the format specifier, characters are read from stdin but they are not brought back. scanfmight accidentally eat a character from stdin and it would be gone into the dark abyss unless it was stored in a buffer somewhere. If I remember correctly, it disappears as long as you read it from stdin but maybe on some Unix you can read it back again. This raises a confusion for the user because the user does not know how much was read from stdin exactly. The other function combination lets you know exactly what was read from stdin. This might cause even more confusion when two or more pairs of scanf are used one after the other because it is hard to know what is left in stdin after the preceding calls to scanf.

In general, my recommendation is this: avoid using scanf unless you can absolutely control what the standard input is going to be and what will be its format. Also, besides all of the aforementioned problems, scanf does present some security issues. For example, the format "%s" lets the user input any length string into a specified char *. A malicious user can easily use this to over-run the buffer and write any arbitrary data to memory. I hope you will take this into account the next time you will write code such as this.

No, programming languages do not have speeds

I see time and time again this myth that programming languages have varying speeds perpetually peddled in various discussions between programmers in different IRC channels, forums, and so on. Some people think that, for example, C is much, much faster than Python. And a few of them go even further – they argue that C is a language that is “close to the metal”. Well, let me tell you. The code that you write in C is actually dedicated for a virtual machine that can eat that code and spit out assembly which roughly does the things that it is supposed to according to the C language standard. Why roughly? Simply because that virtual machine is poorly defined and it has a lot of undefined behaviour. This opposes our original argument that C is “close to the metal”. How could real hardware be poorly defined? Could it be that if you executed a certain instruction one time it would do something completely different, that was not mentioned at all in the CPU manual? Well, the answer is obviously no. Nobody would use such computers.

In general, a language (either spoken or a programming language) is just a conjugation of lexicon, syntax, or, in other words, it is just a way for us to express our thoughts that is understood by other people or computers either by using their brain to interpret the sounds waves that reach their ear-drums or by interpreting the logical structure of text. A computer at the end of the parsing pipeline really just executes a *lot* of “primitive” instructions at the CPU level (for brevity let’s say that only the CPU executes instructions and takes significant decisions). And those instructions don’t take the same amount of time to execute. How could it be that one way or another would be faster? It really mostly *depends* on the parsing part. Obviously, CPU have these things called caches and so on which might influence results but the former part still remains the most important one.

Certainly all of those abstractions don’t come without a cost but at the bottom line the languages themselves don’t define how fast they are. Instead, what we should be talking about are their implementations at the very least. I am pretty sure that when people are having those discussions,  they do not have some kind of fastest or slowest implementation in their mind.  So, if we are talking about speeds, we ought to compare the speeds of functionally equivalent, compiled programs on specific implementations. Even then it is problematic because we need to agree on what is the definition of “speed”.

The number of CPU instructions that are executed? Well, some instructions are faster, some are slower. Just because there are more of them does not mean that the program is slower.

Number of source code lines? The sheer size of program’s size does not translate directly into how big the resulting executable is. Also, see the former paragraph.

Memory usage? Even if some program allocates, let’s say, 2GB of RAM it still doesn’t mean that it is slower. It might calculate the answer quicker regardless of how much RAM it needs.

All in all, ideally we would be talking only about objective things. However, I think this is an utopia in general and in this case because the benchmarking software still runs on some kind of operating system, specific hardware, and so on. Perhaps there is no need to go into so much detail about objectivity when doing a comparison of programming languages but we should at least look at the tip of the ice berg.

That tip is the specific compiler and its version. The benchmarking method and/or software should also be included. You could even include the definitions of various words such as “speed” so everyone would be talking about the same thing when they are using (sometimes) convoluted terms. So, please, let’s all do our part and make our place a little bit more objective instead of spreading anecdata and encouraging people to practice cargo cult (people blindly switching from one language to another because they think it will make their programs magically faster) by saying that the programming language X is faster than Y.