Denigma: Web Crawler

Web Crawler

A web crawler is a program that collects Content from the web.

A web page usually contains Links which are basically pointers to other web pages.

A crawler starts with one page, finds all the Links on that page, follows those to find other pages. Then on those other pages it will follow the links on those pages.

Utility Libraries

The following libraries can be used for crawling, but are not crawlers themselves:

Mechanize provides great high-level browsing capabilities (super-simple form filling and submission [http://wwwsearch.sourceforge.net/mechanize/]
Twill is a simple scripting language build on top of Mechanize [http://twill.idyll.org/]
BeautifulSoup + urllib2 [http://www.crummy.com/software/BeautifulSoup/; http://docs.python.org/library/urllib2.html]
Scrapy is a twisted-based web crawler framework [http://scrapy.org/]

Crawler Examples

These crawlers written in Python:

crawler [http://pypi.python.org/pypi/crawler/0.1.0]
pycurl [pycurl.sourceforge.net ]
Hound [http://sourceforge.net/projects/houndwc/]
HarvestMan is a multi-threaded web-crawler [http://bulba.sdsu.edu/docwiki/HarvestMan]
spider.py [http://pypi.python.org/pypi/spider.py/0.5]
Python Spider, Web Crawler, Bot Examples [http://www.example-code.com/python/pythonspider.asp]
Ruya [http://ruya.sourceforge.net/]

Resources

Python based web crawlers [http://stackoverflow.com/questions/419235/anyone-know-of-a-good-python-based-web-crawler-that-i-could-use]
A lot of Django! [http://djangofeeds.soup.io/tag/crawler]

Crawling Process

A crawling process need to start with some page as seed to be crawled. crawled will be empty. The algorithm is keep going as long as there are more pages to crawl. For each step one of the pages will be picked and add that page to crawled to keep track of the fact that it was crawled. Importantly there must be test to check whether a page was already crawled. If it did than it does not do anything else and it will add it to crawled. When it is done it will return list of crawled pages.

start with tocrawl = [seed]
crawled = []
while there are more pages tocrawl:
pick a page from tocrawl
check if it was already crawled
add that page to crawled
add all the link targets on this page to tocrawl
return crawled

Scaling Up

There are three main issues with scaling up and web crawler.

normal politeness that is required on the web
getting a bunch of machines involved in crawling (parallelization)
how to consume a lot of bandwidth so what it keeps the expensive resource busy while still being polite (i.e. max out the bandwidth).

Politeness

There is line in the robot.txt that each domain tells how often a crawler can crawl that domain. However multiple domains can be hosted on the same machine (shared hosting). Than it is hard to make a very big search engine, because it only gets so much as one machine can crawl. In practice crawling is done on thousands of machines. Here each one needs to tell the other ones what it has crawled and what it is about to crawl. All of this communication overhead can actually slow down th crawler.

Parallelization

If only one machine crawls the state is very good and it is easy to keep the state of the crawl on the one machine. The machines involved need to communicate with each other in order to be polite.

Max Out Bandwidth

In order to max out the bandwidth is instead of trying to lessen the communication is to pre-process the corpus that they already have. So if it crawls a set of pages, it can process them and extract out all the links that they point to. Then these links can be normalized so that yahoo.com and www.yahoo.com wind up being the same string and now those strings can be farmed out to thousands machines and they do not have to communicate, because they know they are not going to hit the same domain. Though the might hit the same hosting server and the the same machine on, but it can be take care of that ahead of time as well. o

Tags: Internet, internet, searching, crawling, bot, spider, automation, robot

Update | Engage

1 Comment to Web Crawler

Evgenii on Oct. 24, 2013, 10:11 a.m.

Yet another framework https://bitbucket.org/estin/pomp/
permalink for comment 1

Facets

Professions

Achievements