Jeez I cannot believe I wrote this circa 2001
Seriously now, I live in Greece and the Net has made many inroads in our daily life here like it did when I was in the US. There is enough analysis of the net in the US but there is a curious lack of high level statistics of the status of the net in Greece as well as in Europe. The E.U. citizens have to have a means of comparing their economies, the new net-economies , so that they can have a bird’s eye view of the union. There is Netcraft of course, and they are doing a fine job, but they do not offer a localized per GTLD general view.
But Enough on the concept, let’s get on with the methodology. Consider this an entry level paper on applied Computer Science. The steps presented within are valid for most kinds of computing problems.
Problem Defininition, tool selection
The first step always is to Define the problem. We need to find all the domains within a gTLD and retrieve the metadata of their web sites. By metadata I define the Web server, their IP, their connectivity status. This means we need to have a program to retrieve web pages.
Decide on the software and do not reinvent the wheel. Research the net for a thin web client’s source. Wget will probably work and all the other fetching programs , but we need the absolute in speed.I started working using the LWP package of perl. It is a master’s piece of work but it is geared towards stability not speed. You can find some of this work in the files myscan.pl and myscan2.pl.
The problem that I had with Perl is that while it is fast it is not a speed daemon, and that signals were often getting confused. It might be the case that I am not such a good coder in PERL. In any case after using PERL for some time I decided to have a go with some ‘C’ source. Back in school I wrote a little proggy that would fetch web pages from the net. It has hooks in it for fetching from FTP and GOPHER sites also. So I retrieved it from the computing moth balls and started playing with it. Given the original matrix I played with it , hacked it. The original program would fetch a whole page into a temp file. While this is usefull, when scanning two hundred thousand domains, it is a nightmare.
First step of optimization was to retrieve only the HTTP headers from the webserver instead of doing a “GET /” request. This lessened the load a few orders of magnitude. The second important step was to do away with the temporary file and lessen the load on the system also.
We have trimmed all the fat, now we must see how to improve performance. The nature of the problem is parallel, so breaking it up into pieces is natural. So how do you decide in how many pieces to break it to ? Easy, as many parallel processes as are needed to utilize fully your internet connection without saturating it. I decided in 32 parallel processes which share part of the domain file. It is a good Magic Number.
Oggle the Results
Then I made some runs in the .GR TLD and studied the results. It always pays to study the first results and watch their behaviour. I noticed that domains that map to the same IP address, what we call virtual hosts, also had the same web server. Therefore if you connect to only one of IPs and retrieve the server data, you need not connect to any other domains that map to the same IP. The results will be identical since all the domains will be served by the same server.
Here the problem splits in two. My first attempt to remember IPs and results with C hash tables failed miserably. I misundersttod the programming interface; but more of this lesson later.
Simple yet elegant solution
So now I have a huge data set with domain names which resolve to IPs in a fashion that seems random. Let us study our data set and how we derive it a little bit more. There are a number of scripts that massage the data before they become suitable for scanning. So I went back to the original data set and looked at it a bit more.
The IPs of the domains that are registered with the same ISP, use the same name servers and tend to create clusters. Let me explain this a bit further. In Greece, otenet.gr is an ISP and its name servers are ns1.otenet.gr, ns2.otenet.gr. The domains registered with and served by otenet use the two previous name servers.The IPs these domains resolve to are usually limited to the IPs of the computers of otenet. These IPs appear more or less in contiguous blocks of data with the occasional errant domain that is not hosted by otenet.
So instead of sorting alphabeticaly by domain name, I started sorting by DNS server. In this fashion I get a lot of localized hits to the same IPs. The current publicly availlable source for the masscrawler utilizes this observation. The solution is to use a few simple variables to save the last IP and headers seen. If the current domain resolves to the same IP as the one before we are done. This little trick reduced the time of the scan by half.
The lesson I learned is to study my data after I have defined the problem. It is quite common in the computer industry to ignore the nature of the data, but Computer Science’s trully great algorithms know their data. So heed your schooling, it is the theoritical foundation of the engineer-scientist that makes him/her useful.
Time to backtrack. The above trick made the scanner fast but we can still save a lot of time by utilizing hash tables. There is a saying on the scientific community that problems become trivial and obvious once they are explained. So I started to code probe programs for using GNU-C’s hash tables and low and behold, things did indeed become trivial. The current hash table implementation, does not insert any data in the hash tables, but pointers to existing strings within the program’s data set.
Indeed the hash table is a index system for remembering the places you have visited. To remember them you have to store them, and under storage you can retrieve them. The current alpha code for the masscrawler has a working hash table implementation which speeds things further.
A note of interest here. The fact that hash tables reduce the number of connections does not diminish the value of the observation about data localization. Using both data localization and hash tables makes Xray really fast indeed. Come to think of it it was fortunate that I did not have a programming engine that would help me brute force the problem. I was forced to consider all aspects of it so I could save computer cycles.
A little detail. There is an alarm inside the code which says that if the domain has not resolved, or we were unable to connect to it and retrieve the headers, within the timeout period, it should be considered inactive. There is a distinction made betweern unresolved and irresponsive domains. PERL tended to confuse alarms when dealing with forked processes, snd this was one of the reasons I gave up on it.
This alarm timeout is another magic number. It should be a prime number, or else the alarms will tend to arrive to all the parallel processes in the same time, forming signal storms. Check with your current manual of the universe, it tends to work in streaks, you know bad streak, lucky streak etc. The timeout value should be enough to allow relatively decent ISPs to respond, but not large enough to bog down the scanning process. My favorite numbers are 11,13 and 17 for fast, medium and slow responses respectively. Adjust in the masscrawler code according to taste.
Yes I know it should have been a command line argument, see below the section for a research grant.
The current parallelization is done by splitting the data set into 32 parts and feeding each part into a different process. Natural data proximity makes domains that map to the same IP to appear relatively close in the partial data sets.
This assumption although true is not always the case. We can save connections by maintaining a shared memory IP/Server/Hash-table data space. This way we will connect to each IP only once. A semaphore will be enough to control which processes access the hash table to enter data. The savings are significant, for example on the .AT gTLD there are approximately 200.000 domains and 8.000 servers . We can save an order of magnitude of processing.
To Further this step we can use a database like progress as a global store which will also allow us a better view at the data retrieved. Historical data and data mining are always easier with an SQL interface and the overhead will be minimal compared to the cost of the internet connection
Another problem is that the load on the name resolver is quite considerable while running the scan.Subsequent scans tend to resolve the same domain names to the same IP addresses. The cases where a domain changed ISP and therefore IP are limited. This observation allows us to offload DNS processing to an asynchronous process after the first run. We need to do at least one run to get all the IPs, after that we have them in a local database. In future runs an asynchronous process resolves the domains and flags the ones changed, now we only need to scan the changed ones. We can save another two orders of magnitude of processing.
Seems to me that this section could form a research grant application
The steps above are not always as clear cut as they appear. Only hindsight helps us review them clearly although the methodology is indeed valid. In real life the defining/coding/debugging cycle is continuus and the parts mesh into and out of each other. As long as we maintain a grip on which part or step is which, we can maintain control of the project.
I use Xray as a podium to present the scientific method for implementing projects because it is small in definition, easy to comprehend,code apply and optimize.
It is a perfect textbook exercise for Computer Scientists
For me it was a way to review and rehone my coding abilities now that I am in the process of moving from a coder to a project or IT manager. There are a lot of philosophical issues hidden deep inside parts of the code and the need for such packages but these must be reviewed in a workshop or in classes somewhere like an open university.
I write in English although my native language is Greek because I want to share my ideas with the rest of the world, and because my formal computer education took place in the United States. After all today’s Lingua Franca is English.:-)