Putting Together A Crawler: Tutorial and Course

Putting Together A Crawler Tutorial and Course is your ultimate guide to Putting Together A Crawler, including facts and information about Putting Together A Crawler. The goal of this tutorial and course for Putting Together A Crawler is committed to helping you to understand and master the content about Putting Together A Crawler. The tutorial and course for Putting Together A Crawler includes the following sections, covering the following areas of Search Engine Optimization:

Putting Together A Crawler


Putting Together A Crawler: Overview


The World Wide Web Consortium (W3C) has published a reference implementation of the HTTP client protocol in a package called W3C libwww. It is written in C and runs on most operating systems. The flexibility and consequent complexity of the API may be daunting, but the package greatly facilitates the writing of reasonably high-performance crawlers. Commercial crawlers probably resemble crawlers written using this package up to the point where storage management begins.

Because the details of commercial crawlers are carefully guarded, I will focus on the design and use of the W3C libwww library instead. This tutorial has two parts. In the first part, I will discuss the internals of a crawler built along the same style as w3c-libwww. Since w3c-libwww is large, general, powerful, and complex, I will abstract its basic structure through pseudocode that uses C++ idioms for concreteness. In the second part, I will give code fragments that show how to use W3C libwww.



Putting Together A Crawler: Design Of The Core Components


It is easiest to start building a crawler with a core whose only responsibility is to copy bytes from network sockets to storage media: this is the Crawler class. The Crawler’s contract with the user is expressed in these methods:


class Crawler {
void setDnsTimeout(int milliSeconds);
void setHttpTimeout(int milliSeconds);
void fetchPush(const string& url);
virtual boolean fetchPull(string& url); // set url, return success
virtual void fetchDone(const string& url, const ReturnCode returnCode, // timeout, server not found, ...
const int httpStatus, // 200, 404, 302, ...
const hash_map& mimeHeader,
// Content-type = text/html
// Last-modified = ...
const unsigned char * pageBody,
const size_t pageSize);
};


The user can push a URL to be fetched to the Crawler. The crawler implementation will guarantee that within a finite time (preset by the user using setDnsTimeout and setHttpTimeout) the termination callback handler fetchDone will be called with the same URL and associated fields as shown. (I am hiding many more useful arguments for simplicity.) fetchPush inserts the URL into a memory buffer: this may waste too much memory for a Web-scale crawler and is volatile. A better option is to check new URLs into a persistent database and override fetchPull to extract new work from this database. The user also overrides the (empty) fetchDone method to process the document, usually storing page data and metadata from the method arguments, scanning pageBody for outlinks, and recording these for later fetchPulls. Other functions are implemented by extending the Crawler class. These include retries, redirections, and scanning for outlinks. In a way, “Crawler” is a misnomer for the core class; it just fetches a given list of URLs concurrently.

Let us now turn to the implementation of the Crawler class. We will need two helper classes called DNS and Fetch. Crawler is started with a fixed set of DNS servers. For each server, a DNS object is created. Each DNS object creates a UDP socket with its assigned DNS server as the destination. The most important data structure included in a DNS object is a list of Fetch contexts waiting for the corresponding DNS server to respond:


class DNS {
list waitForDns;
.
. //other members
.
}


A Fetch object contains the context required to complete the fetch of one URL using asynchronous sockets. waitForDns is the list of Fetches waiting for this particular DNS server to respond to their address-resolution requests. Apart from members to hold request and response data and methods to deal with socket events, the main member in a Fetch object is a state variable that records the current stage in retrieving a page:


typedef enum {
STATE_ERROR = -1, STATE_INIT = 0,
STATE_DNS_RECEIVE, STATE_HTTP_SEND, STATE_HTTP_RECEIVE, STATE_FINAL
} State;
State state;


For completeness I also list a set of useful ReturnCodes. Most of these are self-explanatory; others have to do with the innards of the DNS and HTTP protocols.


typedef enum {
SUCCESS = 0,
//----------------------------------------------------------------------
DNS_SERVER_UNKNOWN,
DNS_SOCKET, DNS_CONNECT, DNS_SEND, DNS_RECEIVE, DNS_CLOSE, DNS_TIMEOUT,
// and a variety of error codes DNS_PARSE_... if the DNS response
// cannot be parsed properly for some reason
//----------------------------------------------------------------------
HTTP_BAD_URL_SYNTAX, HTTP_SERVER_UNKNOWN,
HTTP_SOCKET, HTTP_CONNECT, HTTP_SEND, HTTP_RECEIVE,
HTTP_TIMEOUT, HTTP_PAGE_TRUNCATED,
//----------------------------------------------------------------------
MIME_MISSING, MIME_PAGE_EMPTY, MIME_NO_STATUS_LINE,
MIME_UNSUPPORTED_HTTP_VERSION, MIME_BAD_CHUNK_ENCODING
} ReturnCode;


The remaining important data structures within the Crawler are given below.


class Crawler {
deque waitForIssue;
// Requests wait here to limit the number of network connections.
// When resources are available, they go to...
hash_map dnsSockets;
// There is one entry for each DNS socket, i.e., for each DNS server.
// New Fetch record entries join the shortest list.
// Once addresses are resolved, Fetch records go to...
deque waitForHttp;
// When the system can take on a new Http connection, Fetch records
// Move from waitForHttp to...
hash_map httpSockets;
// A Fetch record completes its lifetime while attached to an Http socket.
// To avoid overloading a server, we keep a set of IP addresses that
// we are nominally connected to at any given time
hash_set busyServers;
.
. //rest of Crawler definition
.
}


The heart of the Crawler is a method called Crawler::start(), which starts its event loop. This is the most complex part of the Crawler and is given as a pseudocode. Each Fetch object passes through a workflow. It first joins the waitForDNS queue on some DNS object. When the server name resolution step is completed, it goes into the waitForHttp buffer. When we can afford another HTTP connection, it leaves waitForHttp and joins the httpSockets pool, where there are two major steps: sending the Http request and filling up a byte buffer with the response. Finally, when the page content is completely received, the user callback function fetchDone is called with suitable status information. The user has to extend the Crawler class and redefine fetchDone to parse the page and extract outlinks to make it a real crawler.


Crawler::start()
while event loop has not been stopped do
if not enough active Fetches to keep system busy then
try a fetchPull to replenish the system with more work
if no pending Fetches in the system then
stop the event loop
end if
end if
if not enough Http sockets busy and
there is a Fetch f in waitForHttp whose server IP address ∈ busyServers then
remove f from waitForHttp
lock the IP address by adding an entry to busyServers (to be polite to the server)
change f.state to STATE_HTTP_SEND
allocate an Http socket s to start the Http protocol
set the Http timeout for f
set httpSockets[s] to the Fetch pointer
continue the outermost event loop
end if
if the shortest waitForDns is “too short” then
remove a URL from the head of waitForIssue
create a Fetch object f with this URL
issue the DNS request for f
set f.state to STATE_DNS_RECEIVE
set the DNS timeout for f
put f on the laziest DNS’s waitForDns
continue the outermost event loop
end if
collect open SocketIDs from dnsSockets and httpSockets
also collect the earliest deadline over all active Fetches
perform the select call on the open sockets with the earliest deadline
as timeout
if select returned with a timeout then
for each expired Fetch record f in dnsSockets and httpSockets do
remove f
invoke f.fetchDone(...) with suitable timeout error codes
remove any locks held in busyServers
end for
else
find a SocketID s that is ready for read or write
locate a Fetch record f in dnsSockets or httpSockets that was waiting on s
if a DNS request has been completed for f then
move f from waitForDns to waitForHttp
else
if socket is ready for writing then
send request
change f.state to STATE_HTTP_RECEIVE
else
receive more bytes
if receive completed then
invoke f.fetchDone(...) with successful status codes
remove any locks held in busyServers
remove f from waitForHttp and destroy f
end if
end if
end if
end if
end while




Putting Together A Crawler: Case Study


So far we have seen a simplified account of how the internals of a package like W3C libwww is designed; now we will see how to use it. The W3C libwww API is extremely flexible and therefore somewhat complex, because it is designed not only for writing crawlers but for general, powerful manipulation of distributed hypertext, including text-mode browsing, composing dynamic content, and so on. Here we will sketch a simple application that issues a batch of URLs to fetch and installs a fetchDone callback routine that just throws away the page contents.

Unlike the simplified design presented in the previous part, W3C libwww can process responses as they are streaming in and does not need to hold them in a memory buffer. The user can install various processors through which the incoming stream has to pass. For example, we can define a handler called hrefHandler to extract HREFs, which would be useful in a real crawler. It is registered with the W3C libwww system. Many other objects are mentioned in the code fragment above, but most of them are not key to understanding the main idea.

The method fetchDone is quite trivial in our case. It checks if the number of outstanding fetches is enough to keep the system busy; if not, it adds some more work. Then it just frees up resources associated with the request that has just completed and returns. Each page fetch is associated with an HTRequest object, similar to our Fetch object. At the very least, a termination handler should free this request object. If there is no more work to be found, it stops the event loop.
Title: Putting Together A Crawler: Tutorial and Course
Description: Putting Together A Crawler: Tutorial and Course - Your ultimate guide to Putting Together A Crawler, including facts and information about Putting Together A Crawler.
Keywords: Putting Together A Crawler, Putting Together A Crawler Tutorial, Putting Together A Crawler Course, SEO Tutorials, SEO Courses
Subject: Putting Together A Crawler Tutorial, Putting Together A Crawler Course,
Author:
Publisher: SEO University ()
Topics: Putting Together A Crawler, Putting Together A Crawler Tutorial, Putting Together A Crawler Course, SEO Tutorials, SEO Courses
Labels: ,

Share Tutorial and Course for Putting Together A Crawler on Social Networks


  • Share Tutorial and Course for Putting Together A Crawler on Facebook
  • Share Tutorial and Course for Putting Together A Crawler on Twitter
  • Share Tutorial and Course for Putting Together A Crawler on Google+

Follow SEO University on Social Networks


  • SEO University on Google+
  • SEO University on Twitter
  • SEO University on Facebook
  • SEO University on LinkedIn