Putting Together A Crawler: Tutorial and Course

Putting Together A Crawler Tutorial and Course is your ultimate guide to Putting Together A Crawler, including facts and information about Putting Together A Crawler. The goal of this tutorial and course for Putting Together A Crawler is committed to helping you to understand and master the content about Putting Together A Crawler. The tutorial and course for Putting Together A Crawler includes the following sections, covering the following areas of Search Engine Optimization:

Putting Together A Crawler

Putting Together A Crawler: Overview

The World Wide Web Consortium (W3C) has published a reference implementation of the HTTP client protocol in a package called W3C libwww. It is written in C and runs on most operating systems. The flexibility and consequent complexity of the API may be daunting, but the package greatly facilitates the writing of reasonably high-performance crawlers. Commercial crawlers probably resemble crawlers written using this package up to the point where storage management begins.

Because the details of commercial crawlers are carefully guarded, I will focus on the design and use of the W3C libwww library instead. This tutorial has two parts. In the first part, I will discuss the internals of a crawler built along the same style as w3c-libwww. Since w3c-libwww is large, general, powerful, and complex, I will abstract its basic structure through pseudocode that uses C++ idioms for concreteness. In the second part, I will give code fragments that show how to use W3C libwww.

Putting Together A Crawler: Design Of The Core Components

It is easiest to start building a crawler with a core whose only responsibility is to copy bytes from network sockets to storage media: this is the Crawler class. The Crawler’s contract with the user is expressed in these methods:



                  class Crawler {

                  void setDnsTimeout(int milliSeconds);

                  void setHttpTimeout(int milliSeconds);

                  void fetchPush(const string& url);

                  virtual boolean fetchPull(string& url); // set url, return success

                  virtual void fetchDone(const string& url, const ReturnCode returnCode, // timeout, server not found, ...

                  const int httpStatus, // 200, 404, 302, ...

                  const hash_map& mimeHeader,

                  // Content-type = text/html

                  // Last-modified = ...

                  const unsigned char * pageBody,

                  const size_t pageSize);

                  };

The user can push a URL to be fetched to the Crawler. The crawler implementation will guarantee that within a finite time (preset by the user using setDnsTimeout and setHttpTimeout) the termination callback handler fetchDone will be called with the same URL and associated fields as shown. (I am hiding many more useful arguments for simplicity.) fetchPush inserts the URL into a memory buffer: this may waste too much memory for a Web-scale crawler and is volatile. A better option is to check new URLs into a persistent database and override fetchPull to extract new work from this database. The user also overrides the (empty) fetchDone method to process the document, usually storing page data and metadata from the method arguments, scanning pageBody for outlinks, and recording these for later fetchPulls. Other functions are implemented by extending the Crawler class. These include retries, redirections, and scanning for outlinks. In a way, “Crawler” is a misnomer for the core class; it just fetches a given list of URLs concurrently.

Let us now turn to the implementation of the Crawler class. We will need two helper classes called DNS and Fetch. Crawler is started with a fixed set of DNS servers. For each server, a DNS object is created. Each DNS object creates a UDP socket with its assigned DNS server as the destination. The most important data structure included in a DNS object is a list of Fetch contexts waiting for the corresponding DNS server to respond:



                  class DNS {

                  list waitForDns;

                  .

                  . //other members

                  .

                  }

A Fetch object contains the context required to complete the fetch of one URL using asynchronous sockets. waitForDns is the list of Fetches waiting for this particular DNS server to respond to their address-resolution requests. Apart from members to hold request and response data and methods to deal with socket events, the main member in a Fetch object is a state variable that records the current stage in retrieving a page:



                  typedef enum {

                  STATE_ERROR = -1, STATE_INIT = 0,

                  STATE_DNS_RECEIVE, STATE_HTTP_SEND, STATE_HTTP_RECEIVE, STATE_FINAL

                  } State;

                  State state;

For completeness I also list a set of useful ReturnCodes. Most of these are self-explanatory; others have to do with the innards of the DNS and HTTP protocols.



                  typedef enum {

                  SUCCESS = 0,

                  //----------------------------------------------------------------------

                  DNS_SERVER_UNKNOWN,

                  DNS_SOCKET, DNS_CONNECT, DNS_SEND, DNS_RECEIVE, DNS_CLOSE, DNS_TIMEOUT,

                  // and a variety of error codes DNS_PARSE_... if the DNS response

                  // cannot be parsed properly for some reason

                  //----------------------------------------------------------------------

                  HTTP_BAD_URL_SYNTAX, HTTP_SERVER_UNKNOWN,

                  HTTP_SOCKET, HTTP_CONNECT, HTTP_SEND, HTTP_RECEIVE,

                  HTTP_TIMEOUT, HTTP_PAGE_TRUNCATED,

                  //----------------------------------------------------------------------

                  MIME_MISSING, MIME_PAGE_EMPTY, MIME_NO_STATUS_LINE,

                  MIME_UNSUPPORTED_HTTP_VERSION, MIME_BAD_CHUNK_ENCODING

                  } ReturnCode;

The remaining important data structures within the Crawler are given below.



                  class Crawler {

                  deque waitForIssue;

                  // Requests wait here to limit the number of network connections.

                  // When resources are available, they go to...

                  hash_map dnsSockets;

                  // There is one entry for each DNS socket, i.e., for each DNS server.

                  // New Fetch record entries join the shortest list.

                  // Once addresses are resolved, Fetch records go to...

                  deque waitForHttp;

                  // When the system can take on a new Http connection, Fetch records

                  // Move from waitForHttp to...

                  hash_map httpSockets;

                  // A Fetch record completes its lifetime while attached to an Http socket.

                  // To avoid overloading a server, we keep a set of IP addresses that

                  // we are nominally connected to at any given time

                  hash_set busyServers;

                  .

                  . //rest of Crawler definition

                  .

                  }

The heart of the Crawler is a method called Crawler::start(), which starts its event loop. This is the most complex part of the Crawler and is given as a pseudocode. Each Fetch object passes through a workflow. It first joins the waitForDNS queue on some DNS object. When the server name resolution step is completed, it goes into the waitForHttp buffer. When we can afford another HTTP connection, it leaves waitForHttp and joins the httpSockets pool, where there are two major steps: sending the Http request and filling up a byte buffer with the response. Finally, when the page content is completely received, the user callback function fetchDone is called with suitable status information. The user has to extend the Crawler class and redefine fetchDone to parse the page and extract outlinks to make it a real crawler.



                  Crawler::start()

                  while event loop has not been stopped do

                  if not enough active Fetches to keep system busy then

                  try a fetchPull to replenish the system with more work

                  if no pending Fetches in the system then

                  stop the event loop

                  end if

                  end if

                  if not enough Http sockets busy and

                  there is a Fetch f in waitForHttp whose server IP address ∈ busyServers then

                  remove f from waitForHttp

                  lock the IP address by adding an entry to busyServers (to be polite to the server)

                  change f.state to STATE_HTTP_SEND

                  allocate an Http socket s to start the Http protocol

                  set the Http timeout for f

                  set httpSockets[s] to the Fetch pointer

                  continue the outermost event loop

                  end if

                  if the shortest waitForDns is “too short” then

                  remove a URL from the head of waitForIssue

                  create a Fetch object f with this URL

                  issue the DNS request for f

                  set f.state to STATE_DNS_RECEIVE

                  set the DNS timeout for f

                  put f on the laziest DNS’s waitForDns

                  continue the outermost event loop

                  end if

                  collect open SocketIDs from dnsSockets and httpSockets

                  also collect the earliest deadline over all active Fetches

                  perform the select call on the open sockets with the earliest deadline

                  as timeout

                  if select returned with a timeout then

                  for each expired Fetch record f in dnsSockets and httpSockets do

                  remove f

                  invoke f.fetchDone(...) with suitable timeout error codes

                  remove any locks held in busyServers

                  end for

                  else

                  find a SocketID s that is ready for read or write

                  locate a Fetch record f in dnsSockets or httpSockets that was waiting on s

                  if a DNS request has been completed for f then

                  move f from waitForDns to waitForHttp

                  else

                  if socket is ready for writing then

                  send request

                  change f.state to STATE_HTTP_RECEIVE

                  else

                  receive more bytes

                  if receive completed then

                  invoke f.fetchDone(...) with successful status codes

                  remove any locks held in busyServers

                  remove f from waitForHttp and destroy f

                  end if

                  end if

                  end if

                  end if

                  end while

Putting Together A Crawler: Case Study

So far we have seen a simplified account of how the internals of a package like W3C libwww is designed; now we will see how to use it. The W3C libwww API is extremely flexible and therefore somewhat complex, because it is designed not only for writing crawlers but for general, powerful manipulation of distributed hypertext, including text-mode browsing, composing dynamic content, and so on. Here we will sketch a simple application that issues a batch of URLs to fetch and installs a fetchDone callback routine that just throws away the page contents.

Unlike the simplified design presented in the previous part, W3C libwww can process responses as they are streaming in and does not need to hold them in a memory buffer. The user can install various processors through which the incoming stream has to pass. For example, we can define a handler called hrefHandler to extract HREFs, which would be useful in a real crawler. It is registered with the W3C libwww system. Many other objects are mentioned in the code fragment above, but most of them are not key to understanding the main idea.

The method fetchDone is quite trivial in our case. It checks if the number of outstanding fetches is enough to keep the system busy; if not, it adds some more work. Then it just frees up resources associated with the request that has just completed and returns. Each page fetch is associated with an HTRequest object, similar to our Fetch object. At the very least, a termination handler should free this request object. If there is no more work to be found, it stops the event loop.

Title: Putting Together A Crawler: Tutorial and Course
Description: Putting Together A Crawler: Tutorial and Course - Your ultimate guide to Putting Together A Crawler, including facts and information about Putting Together A Crawler.
Keywords: Putting Together A Crawler, Putting Together A Crawler Tutorial, Putting Together A Crawler Course, SEO Tutorials, SEO Courses
Subject: Putting Together A Crawler Tutorial, Putting Together A Crawler Course, SEO Tutorials and Courses
Author: SEO University
Publisher: SEO University (Thursday, August 13, 2015)
Topics: Putting Together A Crawler, Putting Together A Crawler Tutorial, Putting Together A Crawler Course, SEO Tutorials, SEO Courses
Labels: Elementary SEO Tutorials, SEO Tutorials and Courses

Share Tutorial and Course for Putting Together A Crawler on Social Networks

Share Tutorial and Course for Putting Together A Crawler on Facebook
Share Tutorial and Course for Putting Together A Crawler on Twitter
Share Tutorial and Course for Putting Together A Crawler on Google+

Follow SEO University on Social Networks

SEO University on Google+
SEO University on Twitter
SEO University on Facebook
SEO University on LinkedIn