Crawling and Indexing: Overview
We shall visit a large variety of programs that process textual and hypertextual information in diverse ways, but the capability to quickly fetch a large number of Web pages into a local repository and to index them based on keywords is required in many applications. Large-scale programs that fetch tens of thousands of Web pages per second are called crawlers, spiders, Web robots, or bots. Crawling is usually performed to subsequently index the documents fetched. Together, a crawler and an index form key components of a Web search engine.
One of the earliest search engines to be built was Lycos, founded in January 1994, operational in June 1994, and a publicly traded company in April 1996. Lycos was born from a research project at Carnegie Mellon University by Michael Mauldin. Another search engine, WebCrawler, went online in spring 1994. It was also started as a research project, at the University of Washington, by Brian Pinkerton. During the spring of 1995, Louis Monier, Joella Paquette, and Paul Flaherty at Digital Equipment Corporation’s research labs developed AltaVista, one of the best-known search engines with claims to over 60 patents, the highest in this area so far. Launched in December 1995, AltaVista started fielding two million queries a day within three weeks.
Many others followed the search engine pioneers and offered various innovations. HotBot and Inktomi feature a distributed architecture for crawling and storing pages. Excite could analyze similarity between keywords and match queries to documents having no syntactic match. Many search engines started offering a “more like this” link with each response. Search engines remain some of the most visited sites today.
In the later tutorial we will discuss how to write crawlers of moderate scale and capability and addresses various performance issues. Then, discusses how to process the data into an index suitable for answering queries. Indexing enables keyword and phrase queries and Boolean combinations of such queries. Unlike relational databases, the query engine cannot simply return all qualifying responses in arbitrary order. The user implicitly expects the responses to be ordered in a way that maximizes the chances of the first few responses satisfying the information need. These tutorials can be skimmed if you are more interested in mining per se; more in-depth treatment of information retrieval (IR) can be found in many excellent classic texts cited later.