Focused Crawling


Focused Crawling aims to index the web according to a specific theme and thus support domain-specific search engines and thematic web portals. Reinforcement is a very suitable approach to training focused crawlers, due to the nature of crawlers, which can only receive partial feedback at the end of a successful crawl.


You can find a collection of publications in the field of Focused Crawling here (last update: May 6, 2008 - 20 papers).

Source code

Scripts for downloading web pages from dmoz:


Here you can find datasets created from Web pages.

Download the datasets, in Weka format (.arff),here. The description of the datasets can be found in [1].

Single files:


