A web page usually contains Links which are basically pointers to other web pages.
A crawler starts with one page, finds all the Links on that page, follows those to find other pages. Then on those other pages it will follow the links on those pages.
The following libraries can be used for crawling, but are not crawlers themselves:
These crawlers written in Python:
A crawling process need to start with some page as seed to be crawled.
crawled will be empty. The algorithm is keep going as long as there are more pages to crawl. For each step one of the pages will be picked and add that page to crawled to keep track of the fact that it was crawled. Importantly there must be test to check whether a page was already crawled. If it did than it does not do anything else and it will add it to
crawled. When it is done it will return list of crawled pages.
There are three main issues with scaling up and web crawler.
politenessthat is required on the web
consume a lot of bandwidthso what it keeps the expensive resource busy while still being polite (i.e. max out the bandwidth).
There is line in the robot.txt that each domain tells how often it can crawler can crawl that domain. However multiple domains can be hosted on the same machine (shared hosting). Than it is hard to make a very big search engine, because it only gets so much as one machine can crawl. In practice crawling is done on thousands of machines. Here each one needs to tell the other ones what it has crawled and what it is about to crawl. All of this communication overhead can actually slow down th crawler.
If only one machine crawls the state is very good and it is easy to keep the state of the crawl on the one machine. The machines involved need to communicate with each other in order to be polite.
In order to max out the bandwidth is instead of trying to lessen the communication is to pre-process the corpus that they already have. So if it crawls a set of pages, it can process them and extract out all the links that they point to. Then these links can be normalized so that yahoo.com and www.yahoo.com wind up being the same string and now those strings can be farmed out to thousands machines and they do not have to communicate, because they know they are not going to hit the same domain. Though the might hit the same hosting server and the the same machine on, but it can be take care of that ahead of time as well.