Strong Feature Sets from Small Samples (2002)
| Venue: | Journal of Computational Biology |
| Citations: | 22 - 8 self |
BibTeX
@ARTICLE{Kim02strongfeature,
author = {Seungchan Kim and Edward R. Dougherty and Junior Barrera and Yidong Chen and Michael L. Bittner and Jeffrey M. Trent},
title = {Strong Feature Sets from Small Samples},
journal = {Journal of Computational Biology},
year = {2002},
volume = {9},
pages = {127--146}
}
OpenURL
Abstract
For small samples, classi# er design algorithms typically suffer from over# tting. Given a set of features, a classi# er must be designed and its error estimated. For small samples, an error estimator may be unbiased but, owing to a large variance, often give very optimistic estimates. This paper proposes mitigating the small-sample problem by designing classi# ers from a probability distribution resulting from spreading the mass of the sample points to make classi# cation more dif# cult, while maintaining sample geometry. The algorithm is parameterized by the variance of the spreading distribution . By increasing the spread, the algorithm # nds gene sets whose classi# cation accuracy remains strong relative to greater spreading of the sample. The error gives a measure of the strength of the feature set as a function of the spread. The algorithm yields feature sets that can distinguish the two classes, not only for the sample data, but for distributions spread beyond the sample data. For linear classi# ers, the topic of the present paper, the classi# ers are derived analytically from the model, thereby providing an enormous savings in computation time. The algorithm is applied to cancer classi# cation via cDNA microarrays. In particular, the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the algorithm is used to # nd gene sets whose expressions can be used to classify BRCA1 and BRCA2 tumors.







