MLKD logo   Machine Learning &
Knowledge Discovery Group

Focused Crawling


Focused Crawling aims to index the web according to a specific theme and thus support domain-specific search engines and thematic web portals. Reinforcement is a very suitable approach to training focused crawlers, due to the nature of crawlers, which can only receive partial feedback at the end of a successful crawl.


You can find a collection of publications in the field of Focused Crawling here (last update: May 6, 2008 - 20 papers). The list is, of course, incomplete. For suggestions, additions or if you have a paper on this field, please contact Ioannis Partalas (partalas[at]

Source code

Scripts for downloading web pages from dmoz:


Here you can find datasets created from Web pages.

Download the datasets, in Weka format (.arff),here. The description of the datasets can be found in [1].

Single files:


[1] I. Partalas, G. Paliouras, I. Vlahavas, Reinforcement Learning with Classifier Selection for Focused Crawling , 18th European Conference on Artificial Intelligence, 2008 (accepted for presentation)