Giedrius Statkevičius

I encountered an interesting behavior the other day when trying to use HTTP and/or HTTPS proxy on a GNU/Linux machine. cURL and friends read environment variables called http_proxy and https_proxy. So, you are thinking, if one wants to have “global” proxy settings, one just has to create a new executable file that is only writable by root in /etc/profile.d with content like this:

#!/bin/sh
export http_proxy=”http://1.2.3.4”
export https_proxy=”$http_proxy”

However, you might just notice an confusing occurrence when trying to use hostnames like localhost to refer to your own machine. Unfortunately, the priority is given to the proxy setting regardless if the hostname is static (i.e. it is in /etc/hosts) and it resolves to a local IP like 127.0.0.1. It could happen that your proxy will be a bit stupid and resolve localhost to its own (or even a different) IP, and then you will get burned because you want to refer to your computer with it, not the proxy IP. It has potential to especially be confusing if the output will be very similar to what you are expecting. This can happen when e.g. you are trying to set up a simple web server on your machine and the proxy responds with a place holder page. I can confess that I fell into this trap but I will not anymore.

To solve this you have to leverage another environment variable called no_proxy. It contains a list of hosts that will not go through the proxy. So, you have to append another setting to the /etc/profile.d/proxy.sh file that you created before. In the end, it will look something like this:

#!/bin/sh
export http_proxy=”http://1.2.3.4”
export https_proxy=”$http_proxy”
export no_proxy=”localhost”

Even though this solves our problem but on the other hand it has a lot of caveats as well:

You have to list all of the hostnames that you do not want to use the proxy for and that can become susceptible to human errors when the list becomes big. This can partially solved by using a configuration management system like Puppet so that you would be sure that you made no mistakes.
You have to maintain /etc/hosts in tandem with /etc/profile.d/proxy.sh to make sure that the same entries are in both files. Again, configuration management tools like Puppet or Chef can solve this easily but it is error prone when either of the files are modified by a human being so it is not an excuse. And we saw how even big companies like “Amazon” sometimes mess this up.

In conclusion, you have to be careful with /etc/hosts and proxies on GNU/Linux, especially with localhost. Do not forget to at least minimally add it to the no_proxy environment variable if you are going to set a “global” proxy setting. As far as I know, there is no way to easily change this behavior of cURL and friends so that it would look up /etc/hosts first before sending a request to the proxy. Personally, I think some kind of option should be provided to change this default behavior to alleviate the negative aspects that I mentioned before.

Read the fine manual

Time and time I see people who follow all these random online tutorials and then when something does not work they become dazed and confused. “Why this does not work? But this tutorial shows that it should work” – I see similar questions occasionally in various forms on forums and IRC. I think people do not realize that there is some kind of hierarchy of trustworthiness of information sources. We should be conscious of that hierarchy when looking for information and remember it when we notice that something is not correct or up-to-date.

In my opinion, the field of studying history has already nailed this down. They have what is called the primary and secondary sources of information. Primary sources provide direct evidence about an event, object, person, or work of art. The latter thing is similar but they talk and analyze the primary sources [1]. It seems to me that we can draw a parallel between this and the information sources that we use to study programming. However, instead of having a simple distinction between primary and secondary sources, a hierarchy is more suitable because we are talking about researching a thing that we have in front of us at present and we can experiment with it. The only question that remains is: how does the hierarchy look like?

At first, let’s think about what kind of sources we have when we are talking about programming before making it. Personally, I can list these items:

Zeros and ones representing byte code

The actual machine code in the executable or the file that you are examining. This can be considered the primary source in programming. What is inside there is actually executed on your machine so you know that it cannot lie. However, it is very hard to decipher and not very informative. Thus, even though it is the most trustworthy, it is very unfriendly to the person that is trying to learn.
The source code that was compiled to make the executable or a file. In terms of trustworthiness it is almost as good as machine code and it is a very good source from which to learn because source code is written for humans and lets you understand everything relatively easily. The only downside is that you have to know that the executable/file that you have been actually made from that source code. Projects such as the reproducible builds [2] help with that but still that is not available everywhere and you have to be sure that the source code corresponds to that executable.
Empirical observation of what system calls the executable is executing, what kind of options are available, what is the output of various commands and so on. This source of information tells you what is apparently available to you as a user but you cannot be sure about what is exactly happening in all cases thus it is not so trustworthy. Also, by using this source information you cannot know what options and commands are exactly available. What if there is a hidden feature or something that is not documented in the output?
Standards. Now we are entering into the zone where we are not even talking about the actual file/program on your computer. Standards are much more trustworthy than the next item because they are usually governed and released by a rigorous organization such as ISO [3] or ANSI [4]. Also, a lot of deliberation and work goes into making sure everything is correct, orderly, understandable, and that there are no contradictions. On the contrary, they are not so easy to use like the next items because most of the time you have to pay to get the standard. Also, usually they use more technical parlance than the next item.
Documentation released by the manufacturer, vendor. Quality of information released by the original makers tend to vary a lot. However, it is usually well structured, easily understandable so it is not hard to skim and find the relevant information that you are searching for.
Books. This source of information tends to be researched more than the item that goes after this one in the list. This is due to the fact that after the book is released, you cannot change it. Also, most of assertions in books need to be backed up by quotes or citations. However, because it is not made by the original company or a group of people that made the executable/file, it is less trustworthy than the previous item. What is more, the topics of books’ chapters have a tendency to be more abstract than the manuals so sometimes it might be not so easy to find information that you are looking for when compared to official manuals.
Community tutorials, forums, wiki pages, articles. These are the least trustworthy because of the anonymous nature of the Internet. Anyone could write anything and you are never sure if what is written was researched well. There is a reason why no one uses web pages as serious sources of information in the academia. On the other hand, it is very accessible because almost everyone has a mobile phone or a laptop with an internet connection on it nowadays.

We can produce this picture after listing the items:

Hierarchy of Information

My point is that everyone should always keep this in mind. Also, now if someone is doing the same mistake I mentioned at the beginning, you should refer them to this article or this hierarchy. I hope this was useful. Please comment if you do not agree with anything mentioned in this post or if you want to discuss.

Short tale about proxies and /etc/hosts on GNU/Linux

Related Posts

On Trustworthiness of Sources While Gathering Information About Software

Related Posts

Hey there!