Focused Crawling

Introduction

Focused Crawling aims to index the web according to a specific theme and thus support domain-specific search engines and thematic web portals. Reinforcement is a very suitable approach to training focused crawlers, due to the nature of crawlers, which can only receive partial feedback at the end of a successful crawl.

Bibliography

You can find a collection of publications in the field of Focused Crawling here (last update: May 6, 2008 - 20 papers). The list is, of course, incomplete. For suggestions, additions or if you have a paper on this field, please contact Ioannis Partalas (partalas[at]csd.auth.gr).

Source code

Scripts for downloading web pages from dmoz:

Dmoz Link Extractor, it extracts the URLs for the specified list of topics as they appear in the dmoz project directory. It requires the MySQL Dmoz RDF Parser and the following template: cfgTemplate.txt
Download pages for topics, it downloads the webpages that correspond to the hyperlinks of each topic. It uses the GNU wget utility.

Datasets

Here you can find datasets created from Web pages.

Download the datasets, in Weka format (.arff),here. The description of the datasets can be found in [1].

Single files:

Publications

[1] I. Partalas, G. Paliouras, I. Vlahavas, Reinforcement Learning with Classifier Selection for Focused Crawling , 18th European Conference on Artificial Intelligence, 2008 (accepted for presentation)