Web Search & Information Retrieval
In this Web Search and Information Retrieval Tutorial we will discusses how Web search engines work. Search engines have their roots in information retrieval (IR) systems, which prepare a keyword index for the given corpus and respond to keyword queries with a ranked list of documents. The query language provided by most search engines lets us look for Web pages that contain (or do not contain) specified words and phrases. Conjunctions and disjunctions of such clauses are also permitted. Mature IR technology predates the Web by at least a decade. One of the earliest applications of rudimentary IR systems to the Internet was Archie, which supported title search across sites serving files over the File Transfer Protocol (FTP). It was only in the mid-1990s that IR was widely applied to Web content by early adopters such as AltaVista. The new application revealed several issues peculiar to hypertext and Web data: Web pages have internal tag structure, they are connected to each other in semantically meaningful ways, they are often duplicated, and they sometimes lie about their actual contents to rate highly in keyword queries. We will review classical IR and discuss some of the new problems and their solutions.
This Web Search and Information Retrieval will include the following three parts:
- Boolean Queries and the Inverted Index
- Relevance Ranking
- Similarity Search