Web Crawler

Created on Jan. 29, 2013, 8:27 p.m. by Hevok & updated by Hevok on June 13, 2013, 9:06 p.m.

A web crawler is a program that collects Content from the web.

A web page usually contains Links which are basically pointers to other web pages.

A crawler starts with one page, finds all the Links on that page, follows those to find other pages. Then on those other pages it will follow the links on those pages.

Utility Libraries

The following libraries can be used for crawling, but are not crawlers themselves:

Crawler Examples

These crawlers written in Python:

Resources

Crawling Process

A crawling process need to start with some page as seed to be crawled. crawled will be empty. The algorithm is keep going as long as there are more pages to crawl. For each step one of the pages will be picked and add that page to crawled to keep track of the fact that it was crawled. Importantly there must be test to check whether a page was already crawled. If it did than it does not do anything else and it will add it to crawled. When it is done it will return list of crawled pages.

  • start with tocrawl = [seed]
  • crawled = []
  • while there are more pages tocrawl:
  • pick a page from tocrawl
  • check if it was already crawled
  • add that page to crawled
  • add all the link targets on this page to tocrawl
  • return crawled

Scaling Up

There are three main issues with scaling up and web crawler.

  1. normal politeness that is required on the web
  2. getting a bunch of machines involved in crawling (parallelization)
  3. how to consume a lot of bandwidth so what it keeps the expensive resource busy while still being polite (i.e. max out the bandwidth).

Politeness

There is line in the robot.txt that each domain tells how often a crawler can crawl that domain. However multiple domains can be hosted on the same machine (shared hosting). Than it is hard to make a very big search engine, because it only gets so much as one machine can crawl. In practice crawling is done on thousands of machines. Here each one needs to tell the other ones what it has crawled and what it is about to crawl. All of this communication overhead can actually slow down th crawler.

Parallelization

If only one machine crawls the state is very good and it is easy to keep the state of the crawl on the one machine. The machines involved need to communicate with each other in order to be polite.

Max Out Bandwidth

In order to max out the bandwidth is instead of trying to lessen the communication is to pre-process the corpus that they already have. So if it crawls a set of pages, it can process them and extract out all the links that they point to. Then these links can be normalized so that yahoo.com and www.yahoo.com wind up being the same string and now those strings can be farmed out to thousands machines and they do not have to communicate, because they know they are not going to hit the same domain. Though the might hit the same hosting server and the the same machine on, but it can be take care of that ahead of time as well.

web-crawler.jpg

Tags: Internet, internet, searching, crawling, bot, spider, automation, robot
Categories: Tutorial, Quest
Parent: Search

Update entry (Admin) | See changes

1 Comment to Web Crawler

Evgenii  on Oct. 24, 2013, 10:11 a.m.

Yet another framework https://bitbucket.org/estin/pomp/

permalink for comment 1