Ancient specification “bug”: you cannot use colons in your username with the HTTP basic authentication method

Out there, in the wild exists a lot of different authentication schemes or methods however one of them which is relatively popular because it’s implemented in most popular browsers at the moment, has this one peculiar “bug” in its specification – you cannot use a colon (‘:’) in the username field. If you have ever seen a window such as this:

HTTP Basic Auth window on Chrome

Then that website is probably using HTTP basic auth and you must not use a colon in your username on that site because simply you would not be able to do that.

Why, you might ask? Well, simply because in that authentication scheme a colon is used to separate the username from the password. If you used a colon in your username, the HTTP server would not be able to discern between the username and the password because it is transmitted to it in this format with this scheme: username:password.

This scheme is defined in RFC 7617 and RFC 2617. As it says in the RFCs themselves:

Furthermore, a user-id containing a colon character is invalid, as the first colon in a user-pass string separates user-id and password from one another; text after the first colon is part of the password. User-ids containing colons cannot be encoded in user-pass strings.

That is an excerpt from RFC 7617. This is from RFC 2617:

To receive authorization, the client sends the userid and password, separated by a single colon (":") character, within a base64 encoded string in the credentials.

As you can see for yourself, the older version of this RFC (2617 is from June, 1999 whereas 7617 is from September, 2015) does not explicitly state that it is impossible to use a colon with this scheme however it is implicitly stated.

You might be surprised but a lot of software gets this wrong. For example, I recently looked into using ml2grow/GAEPyPI for running a simple PyPI on Google App Engine to reduce the costs. The code is all dandy and nice however the username and password parsing is a bit broken. It all happens here:

(username, password) = base64.b64decode(auth_header.split(' ')[1]).split(':')

As you can see, this code breaks a bit when .split(':') returns more than two results – when the username or password field contains a colon itself. This could be mitigated by using the first result as the username, and by concatenating all following results into a single string which would be used as the password.  I will open a pull request soon to fix this issue. There are probably many more examples such as this.

As far as I know, we can only postulate about why this decision was made. My first thought was that maybe because Internet and computers were not so fast back in June, 1999, the people who made RFC 2617 decided to put it all into one field. This could have been easily remediated by having two separate fields for username and password. Perhaps this would have been too costly? I do not know.

Do you know of any other historical “bugs” in widely used specifications nowadays? Also, maybe you know what might be the reasons why this RFC was made in this way? Please let everyone know in the comments section down below. Thanks for reading and happy hacking!

scanf(3) is a newbie trap, use fgets(3) with sscanf(3) instead

If you have not known before, scanf(3) and fgets(3) are both functions intended for reading something from standard input and doing something with the result – in the former case it is interpreted and the results might possibly be stored in specified arguments and in the latter case the result is simply put into a buffer. The first one is commonly recommended to beginner C programmers for reading something from the user and parsing it. However, it should be avoided mainly because it is very error-prone and it is difficult to understand how it actually works for new people. Instead of scanf(3), sscanf(3) should be used in combination with fgets(3). That way, you can easily guess in what state your standard input stream is after using those functions and you get other benefits. Let me give some examples and explain more.

The following two examples are functionally the same – they both read two numbers from standard input and print them. However, the first one uses scanf(3) and the second one uses fgets(3). Here they are:

#include <stdio.h>

int main(int argc, char *argv[])
{
    int ret, num;

    ret = scanf("%d", &num);
    if (ret == 1)
        printf("%d\n", num);
    return 0;
}
#include <stdio.h>

int main(int argc, char *argv[])
{
    char buf[100];

    if (buf == fgets(buf, 100, stdin)) {
        int num, ret;

        ret = sscanf(buf, "%d", &num);
        if (ret == 1)
            printf("%d\n", num);
    }
    
    return 0;
}

You can try them out for yourself. Let us go through the list of differences. I think the best way to illustrate them is to go through a list of different kinds of inputs that the user maybe provide to these programs and see what are the differences:

  • Valid input. Let us say the user entered “123\n”. scanf would read “123” from stdin and it would leave the “\n” in the stream. However, fgets would eat the “\n” as well. This is where the first problem occurs: new C programmers tend to think that scanf would read the “\n” as well or they do not realise that it is still there. This might not pose an issue if your stdin is not line buffered (e.g. when piping a file into a program) however most of the time it is – as far as I know most of the terminals wait for the user to press “Enter” before sending the user’s input to a program in the foreground. On the other hand, fgets would leave stdin in a predictable state – you will always know that if ‘\n’ existed in stdin then it was eaten (unless your buffer is too small). And it is very easy to check this hypothesis – just check if the last character in your buffer is ‘\n’. Also, you know that you need to read more characters (digits) from stdin to get the full number if the last character is not ‘\n’ and you still have something to read from stdin.
  • Valid input but with some whitespace at the beginning. scanfautomagically skips over it but however fgets presents an opportunity for you to see what kind of whitespace was at the beginning, before the actual number. Buffer size becomes an important question in this case. As always, read until you got the full line. This might or might not be useful depending on your case. In general, fgets in this case introduces more transparency in the process.
  • Invalid input. This is the case when it does not match the format specifier, characters are read from stdin but they are not brought back. scanfmight accidentally eat a character from stdin and it would be gone into the dark abyss unless it was stored in a buffer somewhere. If I remember correctly, it disappears as long as you read it from stdin but maybe on some Unix you can read it back again. This raises a confusion for the user because the user does not know how much was read from stdin exactly. The other function combination lets you know exactly what was read from stdin. This might cause even more confusion when two or more pairs of scanf are used one after the other because it is hard to know what is left in stdin after the preceding calls to scanf.

In general, my recommendation is this: avoid using scanf unless you can absolutely control what the standard input is going to be and what will be its format. Also, besides all of the aforementioned problems, scanf does present some security issues. For example, the format "%s" lets the user input any length string into a specified char *. A malicious user can easily use this to over-run the buffer and write any arbitrary data to memory. I hope you will take this into account the next time you will write code such as this.