Biometrical Letters

Biometrical Letters Vol. 56(2), 2019, pp. 117-138

GENE SELECTION ENSEMBLES AND CLASSIFIER ENSEMBLES
FOR MEDICAL DIAGNOSIS

Małgorzata Ćwiklińska-Jurkowska

Department of Theoretical Foundations of Biomedical Sciences and Medical
Informatics, Collegium Medicum in Bydgoszcz, Nicolaus Copernicus
University in Toruń, Jagiellońska 13-15, 85-067 Bydgoszcz, Poland,
e-mail: mjurkowska@cm.umk.pl

The usefulness of combining methods is examined using the example of microarray cancer data sets, where expression levels of huge numbers of genes are reported. Problems of discrimination into two groups are examined on three data sets relating to the expression of huge numbers of genes. For the three examined microarray data sets, the cross-validation errors evaluated on the remaining half of the whole data set, not used earlier for the selection of genes, were used as measures of classifier performance. Common single procedures for the selection of genes—Prediction Analysis of Microarrays (PAM) and Significance Analysis of Microarrays (SAM)—were compared with the fusion of eight selection procedures, or of a smaller subset of five of them, excluding SAM or PAM. Merging five or eight selection methods gave similar results. Based on the misclassification rates for the three examined microarray data sets, for any examined ensemble of classifiers, the combining of gene selection methods was not superior to single PAM or SAM selection for two of the examined data sets. Additionally, the procedure of heterogeneous combining of five base classifiers—k-nearest neighbors, SVM linear and SVM radial with parameter c=1, shrunken centroids regularized classifier (SCRDA) and nearest mean classifier—proved to significantly outperform resampling classifiers such as bagging decision trees. Heterogeneously combined classifiers also outperformed double bagging for some ranges of gene numbers and data sets, but merging is generally not superior to random forests. The preliminary step of combining gene rankings was generally not essential for the performance for either heterogeneously or homogeneously combined classifiers.

combined methods, discriminant analysis, gene selection