MLKD logo   Machine Learning &
Knowledge Discovery Group
 
 

Learning from Multi-Label Data

Introduction

Traditional single-label classification is concerned with learning from a set of examples that are associated with a single label l from a set of disjoint labels L, |L| > 1. In multi-label classification, the examples are associated with a set of labels Y in L. In the past, multi-label classification was mainly motivated by the tasks of text categorization and medical diagnosis. Nowadays, we notice that multilabel classification methods are increasingly required by modern applications, such as protein function classification, music categorization and semantic scene classification.

Mulan: An Open Source Library for Multi-Label Learning

We have developed and are constantly enriching a Java library for Multi-label learning, called Mulan. Mulan contains several problem transformation and algorithm adaptation methods for multilabel classification and ranking, an evaluation framework that computes several multilabel classification evaluation measures and a class providing data set statistics. It also contains an algorithm and support for hierarchical multi-label classification. Mulan is built on top of Weka and it therefore utilizes its award-wining code base. It is open-source and distributed under the GNU GPL licence. Please contact Grigorios Tsoumakas for bug reports, comments, suggestions or request for help with the library.

Mulan is hosted at SourceForge, so you can grab latest releases from there, as well as the latest development source code from the project's public SVN repository.

There is a Wiki that serves the purpose of a manual for Mulan. API documentation is available together with each release. The API documentation for the latest release is also available from here.

Datasets

This is a collection of several multilabel datasets, properly formatted for use with Mulan. We initially provide a table with some statistics of the datasets, followed by the actual files and their sources.

Statistics

       attributes      
name domain instances nominal numeric labels cardinality density distinct
corel5k new images 5000 499 0 374 3.522 0.009 3175
corel16k (10 samples) new images 13811±87 500 0 161±9 2.867±0.033 0.018±0.001 4937±158
delicious text (web) 16105 500 0 983 19.020 0.019 15806
emotions music 593 0 72 6 1.869 0.311 27
EUR-Lex (directory codes) new text 19348 0 5000 412 1.292 0.003 1615
EUR-Lex (subject matters) new text 19348 0 5000 201 2.213 0.011 2504
EUR-Lex (eurovoc descriptors) new text 19348 0 5000 3993 5.310 0.001 16467
genbase biology 662 1186 0 27 1.252 0.046 32
mediamill video 43907 0 120 101 4.376 0.043 6555
rcv1v2 (subset1) text 6000 0 47236 101 2.880 0.029 1028
rcv1v2 (subset2) text 6000 0 47236 101 2.634 0.026 954
rcv1v2 (subset3) text 6000 0 47236 101 2.614 0.026 939
rcv1v2 (subset4) text 6000 0 47229 101 2.484 0.025 816
rcv1v2 (subset5) text 6000 0 47235 101 2.642 0.026 946
scene image 2407 0 294 6 1.074 0.179 15
tmc2007 text 28596 49060
0 22 2.158 0.098 1341
yeast biology 2417 0 103 14 4.237 0.303 198
bibtex text 7395 1836 0 159 2.402 0.015 2856
bookmarks text 87856 2150 0 208 2.028 0.010 18716
enron text 1702 1001 0 53 3.378 0.064 753
medical text 978 1449 0 45 1.245 0.028 94

Files and Sources

  • corel5k
    files: Train and test sets along with their union and the xml header [corel5k.rar]
    source: Pinar Duygulu, Kobus Barnard, Nando de Freitas, and David Forsyth, Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary , 7th European Conference on Computer Vision, pp IV:97-112, 2002.
    More information: http://kobus.ca/research/data/eccv_2002/
  • EUR-Lex
    files: Cross validation splits of TF-IDF representation of the documents with the first 5000 most frequent features selected, as used in the experiments. Usable for a direct comparison. [eurlex-directory-codes.rar] [eurlex-subject-matters.rar] [eurlex-eurovoc-descriptors.rar]
    source
    : Eneldo Loza Mencía and Johannes Fürnkranz. Efficient pairwise multi­label classification for large-scale problems in the legal domain. In Walter Daelemans, Bart Goethals, and Katharina Morik, editors, Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD-2008), Part II, pages 50-65, Antwerp, Bel­gium, 2008.Springer-Verlag
    More information
    : Knowledge Engineering Group, TU Darmstadt
  • genbase
    files: [genbase.rar] [genbase-train.rar] [genbase-test.rar]
    source: S. Diplaris, G. Tsoumakas, P. Mitkas and I. Vlahavas. Protein Classification with Multiple Algorithms, Proc. 10th Panhellenic Conference on Informatics (PCI 2005), pp. 448-456, Volos, Greece, November 2005.
  • yeast
    files: [yeast.rar] [yeast-train.rar] [yeast-test.rar] [yeast.xml]
    source: A. Elisseeff and J. Weston. A kernel method for multi-labelled classification. In T.G. Dietterich, S. Becker, and Z. Ghahramani, (eds), Advances in Neural Information Processing Systems 14, 2002.

Publications

  1. G. Tsoumakas, I. Katakis, I. Vlahavas, "A Review of Multi-Label Classification Methods", in: Proceedings of the 2nd ADBIS Workshop on Data Mining and Knowledge Discovery (ADMKD 2006), pp 99-109, September 2006, Thessaloniki, Greece.
  2. G. Tsoumakas, I. Katakis, "Multi-Label Classification: An Overview", International Journal of Data Warehousing and Mining, 3(3):1-13, 2007.
  3. G. Tsoumakas, I. Vlahavas, "Random k-Labelsets: An Ensemble Method for Multilabel Classification", Proc. 18th European Conference on Machine Learning (ECML 2007), pp. 406-417, Warsaw, Poland, 17-21 September 2007.
  4. K. Trohidis, G. Tsoumakas, G. Kalliris, I. Vlahavas. "Multilabel Classification of Music into Emotions". Proc. 9th International Conference on Music Information Retrieval (ISMIR 2008), pp. 325-330, Philadelphia, PA, USA, 2008.
  5. E. Spyromitros, G. Tsoumakas, I. Vlahavas, “An Empirical Study of Lazy Multilabel Classification Algorithms”, Proc. 5th Hellenic Conference on Artificial Intelligence (SETN 2008), Springer, Syros, Greece, 2008.
  6. G. Tsoumakas, I. Katakis, I. Vlahavas, “Effective and Efficient Multilabel Classification in Domains with Large Number of Labels”, Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD'08), Antwerp, Belgium, 2008.
  7. I. Katakis, G. Tsoumakas, I. Vlahavas, “Multilabel Text Classification for Automated Tag Suggestion”, Proceedings of the ECML/PKDD 2008 Discovery Challenge, Antwerp, Belgium, 2008.
  8. A. Dimou, G. Tsoumakas, V. Mezaris, I. Kompatsiaris, I. Vlahavas, “An Empirical Study Of Multi-Label Learning Methods For Video Annotation”, 7th International Workshop on Content-Based Multimedia Indexing, IEEE, Chania, Crete, 2009
  9. G. Tsoumakas, I. Katakis, I. Vlahavas, "Mining Multi-label Data", Data Mining and Knowledge Discovery Handbook, O. Maimon, L. Rokach (Ed.), Springer, 2nd edition, 2009.
  10. G. Nasierding, G. Tsoumakas, A. Kouzani, “Clustering Based Multi-Label Classification for Image Annotation and Retrieval”, 2009 IEEE International Conference on Systems, Man, and Cybernetics, IEEE, 2009.
  11. G. Tsoumakas, A. Dimou, E. Spyromitros, V. Mezaris, I. Kompatsiaris, I. Vlahavas, “Correlation-Based Pruning of Stacked Binary Relevance Models for Multi-Label Learning”, Proceedings of the 1st International Workshop on Learning from Multi-Label Data (MLD'09), G. Tsoumakas, Min-Ling Zhang, Zhi-Hua Zhou (Ed.), pp. 101-116, Bled, Slovenia, 2009.
  12. G. Tsoumakas, E. Loza Mencia, I. Katakis, S. Park, J. Furnkrnaz, “On the combination of two decompositive multi-label classification methods”, Workshop on Preference Learning, ECML PKDD 09, Eyke Hullermeir, Johannes Furnkranz (Ed.), pp. 114-133, Bled, Slovenia, 2009.

Bibliography

Have a look at our new online multi-label learning bibliography at CiteULike (100 papers, September, 2009). Much more useful, as you can grab BibTeX and RIS records, subscribe to the corresponding RSS feed, follow links to the papers' full pdf (may require access to digital libraries) and export the complete bibliography for BibTeX or EndNote use (requires CiteULike account).

Links

 

Valid XHTML 1.0 Transitional Valid CSS!