CLASSIFICATION OF WEB SERVERS BY GEOGRAPHICAL LOCATION Darcy G. Benoit, André Trudel, Lehe Cai Jodrey School of Computer Science Acadia University Wolfville, Nova Scotia, Canada B4P 2R6 +1 (902) 585-1390 {darcy.benoit, andre.trudel, 079884c}@acadiau.ca ABSTRACT Using the data gathered from a recent Web census, we determine the geographic location of every Web server that is publicly accessible from the internet. Counts of Web servers by country and network class are presented and discussed. KEYWORDS Measurement, World Wide Web, Census, Classification by country 1. INTRODUCTION The typical user views the World Wide Web as a vast collection of pages of information. As long as there is an internet connection and browser, the Web pages can be viewed at any time and from anywhere. This omnipresence gives the user the impression that the information is literally at their fingertips. The reality is that the user is often accessing international networks to download pages from servers located in faraway countries. The research described in this paper focuses on determining the geographical location of Web servers on the Internet. Geographical information is interesting because it tells us where the server physically resides as opposed to the locality of the information stored on the Web server itself. For example, a US company located in New York may have its Web site hosted on a server physically located in Panama. The visitor has the false impression that they are visiting a site in New York. A Web server s physical location also has legal implications. The perceived location of the web site may have an effect on the actions of the user of the website. For example, if a user believes that they are posting on a message board in the US when, in reality, the web server for the message board is in another country, then it is not the free speech laws from the US that are applied to the posting, but the laws of the country where the web server is located. Also, gambling laws are circumvented by having servers in countries where gambling is legal. These servers offer gambling websites targeted at citizens of other countries whose laws prohibit online gambling. Determining the number of Web servers per country is an important measure of a country s presence and impact on the Web. Lastly, geographic information can help us determine the impact a natural disaster may have on the Web. In this paper, we begin by describing our Web census project and how we collect information on Web servers. Next, we explain how we determined the geographical location of each Web server that we found during the census. We follow with an overview and analysis of the
geographic information based both on network class and overall numbers. We will then list the top 20 countries by server count, and further review that information based on network class. We also review the bottom 20 countries by server count. 2. WEB CENSUS PROJECT In the past, estimation techniques were used to estimate the size of the Web (e.g., [1][2][3][4][5][6][7][8]). Regardless of the technique used, an estimate was generated and it was impossible to measure its accuracy. Details and drawbacks of each estimation technique are summarized in [8]. Valid measures of the size of the Web include the number of web pages, web sites or web servers. We measure the size of the Web by counting the number of web servers. More specifically, we count the number of Web servers that have a publicly available Web page on port 80 (the default Web server port). Note that we do not count Web servers running on other ports, running behind a firewall, or not visible to the general public. Our web census stores and counts any web server that responds to our web request, including those that respond with redirections or error codes. The return code is stored along with a copy of the web page returned. Our approach allows us to count physical web server machines as opposed to counting domain names, where several domain names may point to the same physical server. To census the Web, we check every possible IPv4 address for the presence of a Web server. There are 4.3 billion IP addresses in the IPv4 address space, but only 3.7 billion are valid web addresses. The rest of the addresses have been allocated for private networks, loopback, multicasts and reserved networks. In the fall of 2006, we performed the world s first Web census [3][9] that queried every IP address for the presence of a Web server. When a Web census terminates, we begin a new one in order to gather historical information. The data for this paper comes from our latest Web census that found a total of 24,234,406 Web servers between late September 2007 and early March 2008. For each Web server found, we stored a copy of the home page (text only, no images), the IP address of the home page, the amount of time that it took for the Web server to respond to the request for a Web page, the amount of time needed to download the home page, the client machine responsible for the request and the date that the request occurred. We found approximately 7.7 million Web servers in the Class C range, 4.8 million in the Class B range, and 11.7 million in Class A. Although the introduction of CIDR in 1993 should have resulted in a more uniform allocation of IP addresses, this is not reflected in the density of web servers. Class C represents only 14% of valid IP addresses, but makes up for 32% of the servers Figure 1 - Distribution by Traditional Network Class
found. Both Class B and Class A networks have a lower density of web servers than Class C. This can be contributed to the over-allocation of IP addresses to companies and institutions before CIDR that resulted in many unused IP addresses. The comparison of allocation of IP addresses to the allocation of servers is found in Figure 1. 3. GEOGRAPHICAL LOCATION One way to determine the physical location of a web server is to use online conversion tools such as Geo IP Tool [10] or hostip.info [11]. These websites determine the geographical location for individual IP addresses or host names. While these websites might be useful for individual address lookups, it is not feasible to flood them with over 24 million requests. In order to support the large number of queries that we needed to make, we required a downloadable system or database. MaxMind [12] provides two downloadable databases that can be used for converting IP addresses to their geographic location. The first is a commercial database called GeoIP Country, which boasts a 99.8% accuracy rate and is updated weekly. Although the cost for GeoIP Country is not high, we opted to use the alternate GeoLite Country database. The GeoLite Country database is updated monthly, boasts a 99.3% accuracy rate and is available for free. We used the March 2008 version of the GeoLite Country database. The GeoLite Country database uses a single integer to represent each IP address. The conversion from IP address to integer is straightforward. Assume we have a standard IP address of the form A.B.C.D, where each of A, B, C and D are between 0 and 255. The IP number associated with the IP address is calculated using: (256 3 )*A + (256 2 )*B + (256 1 )*C + (256 0 )*D For example, the IP address 131.152.25.1 is converted to the following IP number: (256 3 )*131 + (256 2 )*152 + (256 1 )*25 + (256 0 )*1 = 2,207,783,169 This calculation was performed for each of the 25 million IP addresses retrieved in our Web census. Once the IP addresses were converted to IP numbers, the next task was to look up each IP number in the GeoLite Country database. Each entry in the GeoLite Country database consists of a start IP number S, end IP number E, and a country code CC. All IP numbers in the range [S,E] are located in country CC. The country codes are standard two-letter ISO 3166-1 codes [13]. For example, the code for Japan is JP. 4. RESULTS 4.1 Overall Results A global geographic view of the results is shown in Figure 2. US and China are ranked one and two respectively for the number of servers. They are each assigned their own color. Diagonal stripes are assigned to countries (e.g., Canada) with half to one million servers. There are three other colours which depend on server density. The lightest color is assigned to countries with less than 10,000 Web servers. It is not surprising that the country with the largest number of Web servers is the United States. The US has 11,854,755 servers, which globally represents 49% of the servers. A summary graph is shown in Figure 3. The x-axis, from left to right, has all 249 countries, sorted by decreasing number of servers, Note that the US is omitted from Figure 3 in order to preserve detail.
Figure 2 - Global View of Web Servers by Country Figure 3 - Web Server Distribution for all Countries (Except US) We were successful in identifying the location of all but 0.09% of the Web servers from the census. A summary of the servers that could not be mapped to a country appears in Table 1. Table 1 Web Servers that could not be Mapped to a Country REASON COUNT Unknown 3,730 Satellite Provider 7,721 Anonymous Proxy 11,137 TOTAL 22,588 % of WEB SERVERS 0.09%
4.2 Top 20 Countries The top 20 countries ranked by server count are plotted in Figure 4 (note that the US is omitted to preserve detail). The graph shows a steady decline. Figure 4 - Web Server Count by Country (without the US) As previously mentioned, the US is in the top position. It should be noted that the database used (GeoLite Country) considers all AOL users as belonging to the United States, so the United States server count is inflated by the number of AOL users not in the United States that are running publicly accessible Web servers. Given that the number of US servers is more than 10 times the country in the number two position (China), we do not consider the number of AOL users significant enough to adjust the rankings. The United States has significantly more Web servers than any other country on the list, and makes up for almost half of all of the Web servers that we found. The country in the second position is China. One can argue that the number of servers for China is both surprisingly high and surprisingly low. China has a low per capita web server ratio with approximately one server for every 1300 citizens, while the United States has a per capita ratio of one server for every 26 citizens. It is possible that access to web technology, broadband and know-how may be limited to only a subset of the Chinese population. It is also possible that Chinese users may host websites that are not physically located in China. On the other hand, the number of servers could be considered high for a country with a government that is sometimes accused of curtailing Internet freedom. According to [14], China has strict laws associated with running websites, resulting in ISPs and web site managers being held legally responsible for the material posted on their websites. As a result, the amount of self-censorship and risk needed to run a web server in China may account for the rather low number of servers located in the country. One interesting entry in Figure 4 is Turkey at position 7. This country is usually not thought of as an international Internet leader, but it has a per capita web server ratio of one server for every 135 residents.
4.3 Bottom 20 Countries The 20 countries with the lowest number of Web servers are shown in Table 2. The countries ranked from 238 to 249 have no servers. Reasons for no Web servers could be anything from a lack of internet access, a lack of broadband, hosting Web servers in different locations (particularly for the smaller islands, where hosting Web servers in a more populated and easily reachable city would provide better uptime), Web servers missed by either our census or the GeoLite Database, or the possibility of these Web servers being classified as AOL or satellite instead of for the countries in question. One such example is Christmas Island with 1402 residents. Its tourism website is run by a hosting service in the United States. In other cases, the Web servers in the country in question may not be accessible by our computers in Canada. This could result from tight Internet restrictions in the country in question or simply a lack of telecommunication infrastructure. Note that a lack of Web servers does not imply that the country does not have Web access. An interesting entry in Table 2 is North Korea in position 244 with 0 servers. The country likely has many Web servers that are firewalled by the government to prevent international access. It is also interesting to note that the country s official Web page is located on a server in Spain. Table 2 - Bottom 20 Countries by Web Server Count Order Count Country Name 230 7 Wallis and Futuna 231 6 Mayotte 232 5 Antarctica 233 4 Comoros 234 3 Bouvet Island 235 3 Guernsey 236 3 Niue 237 1 Tokelau 238 0 Cocos (Keeling) Islands 239 0 Christmas Island 240 0 Western Sahara 241 0 French Guiana 242 0 South Georgia and the South Sandwich Islands 243 0 Heard Island and McDonald Islands 244 0 Korea (North) 245 0 Pitcairn 246 0 Saint Helena 247 0 Svalbard and Jan Mayen 248 0 French Southern Territories 249 0 Timor-Leste
4.4 Countries by Class Given the overwhelming number of Web servers located in the United States, it is no surprise that the top country in Class A networks (Table 3, columns 2 and 3), Class B networks (Table 3, columns 4 and 5) and Class C networks (Table 3, columns 7 and 7) is the United States. What becomes interesting is when we look at countries that appear high in some network classes and low in others. For example, China is ranked #4 in Class A networks and #2 in Class C networks, but does not appear in the top 20 in Class B networks. In a similar fashion, Poland appears as #10 in Class A networks, but does not appear in either the Class B or Class C top 20 lists. Canada ranks #8 in both Class A and Class C, but ranks #2 in Class B networks. Such differences in ranking based on network class is due to how IP addresses are allocated to different countries and how those IP addresses are used in those countries. Many IP addresses have been allocated to the United States and Canada across all three network classes, but it is obvious that Canada seems to have and use a higher proportion of their Class B networks. This may be due to the number of higher education institutions in Canada and the fact that many of these institutions were allocated Class B networks. In the case of Poland, the number of servers found in Class B is very low (5721) and the number of servers found in Class C is moderate (57,020). This implies that the majority of Poland s IP addresses were assigned to the country in the Class A range, which results in the large number of servers found in that range in Poland (224,108). The distribution of IP addresses to countries will have a significant effect on our top 20 rankings by traditional network class. An interesting entry in Table 3 (position 17, column 5) is the Dominican Republic with 18,399 Class B servers. This is a small developing country. Table 3 - Top 20 Countries by Class Class A Class B Class C Order # Servers Country # Servers Country # Servers Country 1 5952346 United States 3210993 United States 2691416 United States 2 508380 Turkey 277021 Canada 589610 China 3 468279 Germany 191596 Japan 529048 Japan 4 427579 China 121522 Israel 349091 Brazil 5 411114 United Kingdom 108687 Brazil 334657 Germany 6 376065 Italy 103144 Australia 267926 United Kingdom 7 262369 Spain 92735 Mexico 249216 Korea 8 255098 Canada 82025 Colombia 236977 Canada 9 243028 Japan 69527 Peru 194461 Australia 10 224108 Poland 69069 France 144956 Taiwan 11 219786 India 62460 Taiwan 128372 Netherlands 12 215384 Netherlands 56379 Italy 108669 France 13 188382 France 53221 Germany 100352 Peru 14 174470 Russian Federation 31592 Argentina 98121 15 134225 Thailand 24765 United Kingdom 91672 Italy Russian Federation 16 130398 Taiwan 21067 Korea 80072 Spain 17 119088 Korea 18399 Dominican Republic 74393 Hong Kong 18 92704 Romania 17359 Netherlands 67704 Costa Rica 19 81545 Denmark 11753 Spain 66696 Thailand 20 73937 Belgium 9905 Venezuela 62916 Colombia
5. CONCLUSIONS AND FUTURE WORK Our approach steps away from the people using the Internet and the registered country address of a Web page by looking at the actual physical location of the Web servers that host the information available on the Web. While our current information is only at the country level, it can be used to determine where the bottlenecks may be for serving Web information, as well as to determine where the concentration of Web information physically resides. The information presented here is novel because it is the first time that this type of geographical information has been extracted from Web census information. By avoiding the pitfalls associated with counting dozens of website addresses that all point to the same machine, we are able to provide a more definitive count and map of the number of machines serving Internet content. We are now able to provide a geographical distribution of where those Web servers reside. Future work in this area will begin with determining the language associated with the root Web page for each server that we visited. We will then be able to match Web server root page languages against the languages in that particular country to see if there is any correlation between the two. We will also begin the process of classifying Web server home pages based on category, such as business, education and sports. This categorization will help us with the classification of Web servers, allowing us to determine not only the location of a server, but the language and topic of the site as well. ACKNOWLEDGEMENTS We would like to thank NSERC and Acadia University for supporting this research. REFERENCES [1] Benoit, D., Slauenwhite, D., and Trudel, A. 2006. A Web Census is Possible. The IEEE/IPSJ Symposium on Applications and the Internet (SAINT2006), (Jan 2006), Phoenix, Arizona, USA, 39-44. [2] Benoit, D., Slauenwhite, D., and Trudel, A. 2007. On the path to a World Wide Web census: A large scale survey. 2nd International Conference on Internet Technologies and Applications (ITA 07), (Sept 2007), Wrexham, UK. [3] Giles, C.L. and Lawrence, S. 1998. Searching the World Wide Web. Science Magazine, 280, 98-100. [4] Giles, C.L. and Lawrence, S. 1999. Searching the Web: General and Scientific Information Access. IEEE Communications Magazine, 116-122. [5] Giles, C.L. and Lawrence, S. 1999. Accessibility of Information on the Web. Nature Science Journal, 400, 107-109. [6] Netcraft s Web Server Survey. 2004. Retrieved May 11, 2004 from http://news.netcraft.com/archives/web_server_survey.html. [7] OCLC. 2004. Online Computer Library Centre Inc. s Web Characterization Project. Retrieved May 11, 2004 from http://wcp.oclc.org. [8] Rhodenizer, D. and Trudel, A. 2004. Estimating the size and content of the World Wide Web. Proceedings of the Third IASTED International Conference on Communications, Internet and Information Technology, US Virgin Islands, 564-570. [9] Benoit, D., and Trudel, A. 2007. World s first Web census. International Journal of Web Information Systems, Volume 3, Number 4, Emerald Group Publishing, (2007), 378-389. [10] Geo IP Tool. 2008. Retrieved 1 July 2008 from http://www.geoiptool.com/. [11] Hostip.info. 2008. Retrieved 1 July 2008 from http://www.hostip.info/. [12] MaxMind. 2008. Retrieved 1 July 2008 from www.maxmind.com.
[13] ISO 3166-1 codes. 2008. Retrieved 1 July 2008 from http://www.iso.org/iso/country_codes/iso_3166_code_lists/english_country_names_and_code_ele ments.htm. [14] Race to the Bottom. Corporate Complicity in Chinese Internet Censorship, Human Rights Watch, Volume 18, No. 8(C), August 2006.