ILK Home People News Publications MBLP book TiMBL MBT Other software Demos ROLAQUAD MITCH IL D-Coi Contact Links Nederlands
  
  Tilburg, July 19, 2006. Original Tilburg University press release (July 10, in Dutch); Daily estimated size of the World Wide Web

How big is the World Wide Web right now? Maurice de Kunder, student of Business Communication and Digital Media at Tilburg University has devoted his Master's thesis to this question. The most reliable estimate at the time of the thesis defense, July 2006, is that it contains at least 14.3 billion web pages, says De Kunder. The Dutch web contains at least 291 million web pages.

To estimate the size of the indexable web, the part of the internet that is searchable by everybody through search engines, De Kunder used a method based on document frequencies of words in different text collections. If a word occurs in thirty articles in a collection of 30,000 newspaper articles, then the expected document frequency of that word is 1 in 1,000. When Google reports that it has indexed 9 million webpages containing that particular word, then it is possible to extrapolate the estimated total number of webpages indexed by Google at 9 billion. By repeating this estimate-by-extrapolation with a large number of words and with the dominant four search engines, Google, Yahoo Search, MSN Search, and Ask, De Kunder was able to estimate the size of the indices of these engines.

 


More ILK Research Group news and events

Directions

These four search engines index largely the same webpages, which means that the sizes of their indices cannot just be added. De Kunder ran a large-scale sample test to estimate the overlap between the different engines, to correct the resulting total estimate downwards. The resulting estimate is 14.3 billion webpages. De Kunder observed a growth of around 2% per month.

Inflated?

A remarkable outcome of the study is that Google tends to yield estimates that vary wildly; in a sample period of a month in which measurements were made on a daily basis, the estimated size of Google's index swayed between 25 and 45 billion webpages. Yahoo Search appears to have a better coverage of the WWW, based on a sample of random URLs (addresses of webpages) that was checked against the four search engines. In addition, Yahoo's index overlaps slightly more with the other three indices than Google's. Yet, the estimated size of Yahoo's index stays well under the varying sizes estimated for the Google's index. De Kunder concludes that the estimated size of Google's index cannot be the basis of a reliable estimate of the size of the world wide web - the search engine might even be inflating its numbers. Yahoo Search is a better basis for such an estimate.

 

Note to the press

Maurice de Kunder presents his Master's thesis and receives his Master's diploma on Wednesday July 19, at 11 AM in room A187 (building A) of Tilburg University, Warandelaan 2, Tilburg, The Netherlands. Thesis supervisor is associate professor dr. Antal van den Bosch, email Antal.vdnBosch@uvt.nl, phone +31 13 466 3117. Also see: Tilburg University press releases.


© 2006 Tilburg University, Antal.vdnBosch@uvt.nl | Last update: Wed Jul 19, 2006