||Author(s): G. Tzanis, I. Vlahavas.
Title: “Mining High Quality Clusters of SAGE Data”.
Click here to download the PDF (Acrobat Reader) file (10 pages).
Clustering, Gene Expression, SAGE, Cancer.
Proceedings of the 2nd VLDB Workshop on Data Mining in Bioinformatics, Vienna, Austria, 2007.
Abstract: Serial Analysis of Gene Expression (SAGE) is a method that
allows the quantitative and simultaneous analysis of the whole
gene function of a cell. One of the advantages of this method is
that the experimenter does not have to select a priori the mRNA
sequences that will be counted in a sample. This makes SAGE a
powerful tool for analyzing gene expression and studying various
diseases, such as cancer. An important concern in cancer studies
is the discovery of the differences between healthy and cancerous
samples and the accurate separation of these two groups of
samples. However, the high dimensionality of the data, the
multiple cell sources (i.e. bulk and cell line) and the multiple
cancer subtypes make very difficult the effective clustering of
SAGE libraries. Furthermore, the various sources of noise pose an
extra challenge to data miners. For all these reasons we propose
an approach that involves the discretization of the data, the
selection of the most prominent gene tags and the use of a
clustering algorithm in order to obtain more compact and reliable
clusters that can assist cancer profiling. We experimented with
two families of clustering algorithms, partitional and hierarchical,
and we utilized various cluster validity criteria in order to
evaluate the resulted clustering structures. The experimental
results have shown that our approach provides more interesting