The Japan Times
July 9, 1999

Internet search engines lost in fast expanding Web data

Study finds all engines combined index just 42%

Los Angeles Times

LOS ANGELES - If searching the World Wide Web for that one nugget of information already seems like a bad trip into a quagmire of data, Internet researchers have a bit of bad news for you - the situation is only getting worse.
Even the most comprehensive search engine in operation is aware of no more than 16 percent of the estimated 800 million pages on the Web, according to a study in the latest issue of the scientific journal Nature. Moreover, the gap between what is posted on the Web and what is retrievable by the search engines is widening fast.
"The amount of information being indexed (by commonly used search engines) is increasing, but it's not increasing as fast as the amount of information being put on the Web," said Steve Lawrence, a researcher at NEC Research Institute in Princeton, N.J., one of the study's authors.
The findings are important because they raise the specter that the Internet may lead to a backward step in the distribution of knowledge at a time of technological revolution: The breakneck pace at which information is added to the Web may actually mean more information is lost to easy public view than is made available.
The study also underscores a little-understood feature of the Internet. While many users believe Web pages are automatically available to the search programs employed by such sites as Yahoo, Excite, and AltaVista, the truth is that finding, identifying, and categorizing new Web pages requires a great expenditure of time, money and technology.
Lawrence and his coauthor, fellow NEC researcher, C. Lee Giles, found that most of the major search engines index less than 10 percent of the Web. Even by combining all the major search engines, only 42 percent of the Web has been indexed, they found.
The rest of the Web - trillions of bytes of data ranging from scientific papers to family photo albums - exist in a kind of black hole of information, impenetrable by Web surfers unless they have the exact address of a given Web site. Even the pages that do end up indexed take an average of six months to be discovered by the search engines, Lawrence and Giles found.
The pace of indexing marks a striking decline from that found in a similar study conducted by the same researchers just a year and a half ago.
At that time, they estimated the number of Web pages in the world at about 320 million. The most thorough search engine in that study, HotBot, covered about a third of all Web pages. Combined, the six leading search engines they surveyed covered about 60 percent of the Web.
While Web surfers often complain about retrieving too much information from search engines, said Oren Etzioni, chief technology officer of the portal Go2net and a professor of computer science at the University of Washington, failing to capture the full scope of the Web would be to surrender one of the most powerful parts of the digital revolution - the ability to seek and share diverse information across the globe.
Etzioni said the mushrooming size of the Web's audience makes the gulf between what is on the Web and what is retrievable increasingly important.
"There is a real price to be paid if you are not comprehensive," he said. "There may be something that is important to only 1 percent of the people. Well, you're talking about maybe 100,000 people."
Lawrence and Giles estimated the number of Web pages by using a special program that searched systematically through 2,500 random Web servers - computers that hold Web pages. They calculated the average number of pages on each server and extrapolated for the 2.8 million servers on the Internet.