Web Crawling: Overview
The World Wide Web (abbreviated as WWW or W3, commonly known as the Web), is a system of interlinked hypertext documents accessed via the Internet. The Web is a collection of billions of documents written in a way that enables them to cite each other using hyperlinks, which is why they are a form of hypertext. These documents, or Web pages, are typically a few thousand characters long, written in a diversity of languages, and cover essentially all topics of human endeavor. Web pages are served through the Internet using the hypertext transport protocol (HTTP) to client computers, where they can be viewed using browsers. Http is built on top of the transport control protocol (TCP), which provides reliable data streams to be transmitted from one computer to another across the Internet.
Throughout this Web Crawling Tutorial, we shall study how automatic programs can analyze hypertext documents and the networks induced by the hyperlinks that connect them. To do so, it is usually necessary to fetch the pages to the computer where those programs will be run. This is the job of a crawler (also called a spider, robot, or bot). In this Web Crawling Tutorial we will study in detail how crawlers work.
We assume that you have basic familiarity with computer networking using TCP, to the extent of writing code to open and close sockets and read and write data using a socket. We will focus on the organization of large-scale crawlers, which must handle millions of servers and billions of pages.
This Web Crawling Tutorial includes the following four parts: