About
Objective
Purpose
Concepts
Philosophy
ILA
GCI
What Next?
Matrix
Terms
Privacy
Aspects
Facets
Research
Programming
Design
Professions
Scientist
Developer
Artist
Achievements
Ranks
Grades
Titles
Content
Data/Entries/Relations
Blog/Posts/Comments
Wiki/Pages/Tags
Articles (Reports/Reviews)
Ontology
Tutorials
Links
Media
Polls
Questionnaires
Repository
Annotations
Classifications
Tissues
Species
Interactions
Expressions
Profiles
Signatures
Intersections
Sets
Meta
Datasets
Create
References
Archive
Changes
Lifespan
Studies
Experiments
Measurements
Comparisons
Interventions
Factors
Variants
Assays
Strains
Regimens
Manipulations
Epistases
Experts
Profiles
Collaborations
Com
Messaging
Announcements
Video
Donate
Log in
or
Sign up
Denigma
>
Wiki
> Page = WebCrawler > Edit
o
Start
|
Search
|
Index
Mouse Lifespan Experiments
(
Hide
)
Wiki
WebCrawler - Editing
leave this field blank to prove your humanity
A web crawler is a program that collects Content from the web. A web page usually contains Links which are basically pointers to other web pages. A crawler starts with one page, finds all the Links on that page, follows those to find other pages. Then on those other pages it will follow the links on those pages. Utility Libraries ---------------------- The following libraries can be used for crawling, but are not crawlers themselves: * Mechanize provides great high-level browsing capabilities (super-simple form filling and submission [http://wwwsearch.sourceforge.net/mechanize/] * Twill is a simple scripting language build on top of Mechanize [http://twill.idyll.org/] * BeautifulSoup + urllib2 [http://www.crummy.com/software/BeautifulSoup/; http://docs.python.org/library/urllib2.html] * Scrapy is a twisted-based web crawler framework [http://scrapy.org/] Crawler Examples ------------------------ These crawlers written in Python: * crawler [http://pypi.python.org/pypi/crawler/0.1.0] * pycurl [pycurl.sourceforge.net ] * Hound [http://sourceforge.net/projects/houndwc/] * HarvestMan is a multi-threaded web-crawler [http://bulba.sdsu.edu/docwiki/HarvestMan] * spider.py [http://pypi.python.org/pypi/spider.py/0.5] * Python Spider, Web Crawler, Bot Examples [http://www.example-code.com/python/pythonspider.asp] * Ruya [http://ruya.sourceforge.net/] Resources ------------- * Python based web crawlers [http://stackoverflow.com/questions/419235/anyone-know-of-a-good-python-based-web-crawler-that-i-could-use] * A lot of Django! [http://djangofeeds.soup.io/tag/crawler] Crawling Process ------------------------- A crawling process need to start with some page as seed to be crawled. ``crawled`` will be empty. The algorithm is keep going as long as there are more pages to crawl. For each step one of the pages will be picked and add that page to crawled to keep track of the fact that it was crawled. Importantly there must be test to check whether a page was already crawled. If it did than it does not do anything else and it will add it to ``crawled``. When it is done it will return list of crawled pages. * start with ``tocrawl`` = [seed] * crawled = [] * while there are more pages ``tocrawl``: * pick a page from ``tocrawl`` * check if it was already crawled * add that page to ``crawled`` * add all the link targets on this page to ``tocrawl`` * return ``crawled`` Scaling Up -------------- There are three main issues with scaling up and web crawler. 1. normal ``politeness`` that is required on the web 2. getting a bunch of machines involved in crawling (``parallelization``) 3. how to ``consume a lot of bandwidth`` so what it keeps the expensive resource busy while still being polite (i.e. max out the bandwidth). ### Politeness There is line in the robot.txt that each domain tells how often it can crawler can crawl that domain. However multiple domains can be hosted on the same machine (shared hosting). Than it is hard to make a very big search engine, because it only gets so much as one machine can crawl. In practice crawling is done on thousands of machines. Here each one needs to tell the other ones what it has crawled and what it is about to crawl. All of this communication overhead can actually slow down th crawler. ### Parallelization If only one machine crawls the state is very good and it is easy to keep the state of the crawl on the one machine. The machines involved need to communicate with each other in order to be polite. ### Max Out Bandwidth In order to max out the bandwidth is instead of trying to lessen the communication is to pre-process the corpus that they already have. So if it crawls a set of pages, it can process them and extract out all the links that they point to. Then these links can be normalized so that yahoo.com and www.yahoo.com wind up being the same string and now those strings can be farmed out to thousands machines and they do not have to communicate, because they know they are not going to hit the same domain. Though the might hit the same hosting server and the the same machine on, but it can be take care of that ahead of time as well.
Tags: