#### DMCA

## Toward optimal feature selection (1995)

### Cached

### Download Links

- [engr.case.edu]
- [ilpubs.stanford.edu:8090]
- [www.ai.mit.edu]
- [chunnan.iis.sinica.edu.tw]
- [www-diglib.stanford.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In 13th International Conference on Machine Learning |

Citations: | 478 - 9 self |

### Citations

12406 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...tion. In this work, we address both theoretical and empirical aspects of feature selection. We describe a formal framework for understanding feature selection, based on ideas from Information Theory (=-=Cover & Thomas 1991-=-). We then present an efficient implemented algorithm based on these theoretical intuitions. The algorithm overcomes many of the problems with existing methods: it has a sound theoretical foundation; ... |

8897 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...number of features grow, our ability to use the training set to approximate this conditional distribution decreases (exponentially). As we now show, we can utilize ideas from probabilistic reasoning (=-=Pearl 1988-=-) to circumvent this problem (to some extent). Intuitively, features that cause a small increase in \Delta are those that give us the least additional information beyond what we would obtain from the ... |

6599 |
C4.5: Programs for machine learning
- Quinlan
- 1993
(Show Context)
Citation Context ...ant features when we conditioned on no variables. To test how our method of feature subset selection affected classification, we employed both a Naive Bayesian classifier (Duda & Hart 1973) and C4.5 (=-=Quinlan 1993-=-) as induction algorithms; these were applied both to the original datasets and to the datasets filtered through our feature selection algorithm (using both forward selection and backward elimination)... |

4841 |
Pattern classification and scene analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...stently selected the 7 relevant features when we conditioned on no variables. To test how our method of feature subset selection affected classification, we employed both a Naive Bayesian classifier (=-=Duda & Hart 1973-=-) and C4.5 (Quinlan 1993) as induction algorithms; these were applied both to the original datasets and to the datasets filtered through our feature selection algorithm (using both forward selection a... |

3781 |
Introduction to statistical pattern recognition (2nd ed
- Fukunaga
- 1990
(Show Context)
Citation Context ...ses us to lose the least amount of information in these distributions. While other measures of separability (notably divergence) have been suggested in the statistics community for feature selection (=-=Fukunaga 1990-=-), these measures are often aimed at selecting features to enhance the separability of the data and may have difficulty in very large dimensional spaces. Hence, they bring with them an inherent bias w... |

2209 |
On information and sufficiency
- Kullback, Leibler
- 1951
(Show Context)
Citation Context ... f G ). Our goal is to select G so that these two distributions are as close as possible. As our distance metric, we use the information-theoretic measure of cross-entropy (also known as KL-distance (=-=Kullback & Leibler 1951-=-)). Thus, we can view this as selecting a set of features G which causes us to lose the least amount of information in these distributions. While other measures of separability (notably divergence) ha... |

1059 |
C4.5: Programs for
- Quinlan
- 1993
(Show Context)
Citation Context ...ed on no variables. To test how our method of feature subset selection affected classification, we employed both a Naive Bayesian classifier (Duda & Hart 1973, Langley, Iba & Thompson 1992) and C4.5 (=-=Quinlan 1993-=-) as induction algorithms; these were applied both to the original datasets and to the datasets filtered through our feature selection algorithm (using both forward selection and backward elimination)... |

854 |
UCI repository of machine learning databases
- Murphy, Aha
- 1995
(Show Context)
Citation Context ...se datasets include: the Corral data which was artificially constructed by John et al (1994) specifically for research in feature selection; the LED24, Vote, and DNA datasets from the UCI repository (=-=Murphy & Aha 1995-=-); and two datasets which are a subset of the Reuters document collection (Reuters 1995). These datasets are detailed in Table 1. We selected these datasets as they are either well understood in terms... |

755 | Irrelevant Features and the Subset Selection Problem
- John, Kohavi, et al.
- 1994
(Show Context)
Citation Context ... of the features makes this algorithm impractical for domains with more than 25-30 features. Another feature selection methodolgy which has recently received much more attention is the wrapper model (=-=John et al. 1994-=-) (Caruana & Freitag 1994) (Langley & Sage 1994). This model employs a search through the space of feature subsets using the estimated accuracy from an induction algorithm as the measure of goodness f... |

439 | An analysis of Bayesian classifier - Langley, Iba, et al. - 1992 |

365 |
The Feature Selection Problem: Traditional Methods and a New Algorithm
- Kira, Rendell
- 1992
(Show Context)
Citation Context ...ction. Thus the bias of the learning algorithm does not interact with the bias inherent in the feature selection algorithm. Two of the most well-known filter methods for feature selection are RELIEF (=-=Kira & Rendell 1992-=-) and FOCUS (Almuallim & Dietterich 1991). In RELIEF, a subset of features in not directly selected, but rather each feature is given a weighting indicating its level of relevance to the class label. ... |

267 |
Pattern Classi cation and Scene Analysis
- Duda, Hart
(Show Context)
Citation Context ... selected the 7 relevant features, but only when we conditioned on no variables. To test how our method of feature subset selection a ected classi cation, we employed both a Naive Bayesian classi er (=-=Duda & Hart 1973-=-, Langley, Iba & Thompson 1992) and C4.5 (Quinlan 1993) as induction algorithms; these were applied both to the original datasets and to the datasets ltered through our feature selection algorithm (us... |

265 | Induction of selective bayesian classifiers,
- Langley, Sage
- 1994
(Show Context)
Citation Context ...or domains with more than 25-30 features. Another feature selection methodolgy which has recently received much attention is the wrapper model (John, Kohavi, & Pfleger 1994) (Caruana & Freitag 1994) (=-=Langley & Sage 1994-=-). This model searches through the space of feature subsets using the estimated accuracy from an induction algorithm as the measure of goodness for a particular feature subset. Thus, the feature selec... |

252 | Learning with many irrelevant features
- Almuallim, Dietterich
- 1991
(Show Context)
Citation Context ...ning algorithm does not interact with the bias inherent in the feature selection algorithm. Two of the most well-known filter methods for feature selection are RELIEF (Kira & Rendell 1992) and FOCUS (=-=Almuallim & Dietterich 1991-=-). In RELIEF, a subset of features in not directly selected, but rather each feature is given a weighting indicating its level of relevance to the class label. RELIEF is therefore ineffective at remov... |

218 | Greedy attribute selection.
- Caruana, Freitag
- 1994
(Show Context)
Citation Context ...s algorithm impractical for domains with more than 25-30 features. Another feature selection methodolgy which has recently received much attention is the wrapper model (John, Kohavi, & Pfleger 1994) (=-=Caruana & Freitag 1994-=-) (Langley & Sage 1994). This model searches through the space of feature subsets using the estimated accuracy from an induction algorithm as the measure of goodness for a particular feature subset. T... |

125 | Wrappers for Performance Enhancement and Oblivious Decision Graphs. - Kohavi - 1995 |

57 | Occamâ€™s razor. - Blumer, Ehrenfeucht, et al. - 1987 |

45 | Localized partial evaluation of belief networks
- Draper, Hanks
- 1995
(Show Context)
Citation Context ...istic in uence tends to attenuate over distance; that is, direct in uence is typically stronger than indirect in uence. (This has been shown both formally and empirically in certain special cases in (=-=Draper & Hanks 1994-=-, Kozlov & Singh 1995).) Therefore, we heuristically choose, as an approximation to the Markov blanket, some set of K features which are strongly correlated with Fi. We nowwant to gure out how close M... |

28 | Probabilistic Reasoning in Intelligent Systems; Networks of Plausibale Inference[M - unknown authors - 1988 |

19 | An analysis of Bayesian classi ers - Langley, Iba, et al. - 1992 |

5 | Sensitivities, An Alternative to Conditional Probabilities for Bayesian Belief Networks
- Kozlov, Singh, et al.
- 1995
(Show Context)
Citation Context ...to attenuate over distance; that is, direct in uence is typically stronger than indirect in uence. (This has been shown both formally and empirically in certain special cases in (Draper & Hanks 1994, =-=Kozlov & Singh 1995-=-).) Therefore, we heuristically choose, as an approximation to the Markov blanket, some set of K features which are strongly correlated with Fi. We nowwant to gure out how close Mi is to being a Marko... |

3 |
Reuters collection available via anonymous ftp. ftp://ciir-ftp.cs.umass.edu/pub/reuters1
- Reuters
- 1995
(Show Context)
Citation Context ... specifically for research in feature selection; the LED24, Vote, and DNA datasets from the UCI repository (Murphy & Aha 1995); and two datasets which are a subset of the Reuters document collection (=-=Reuters 1995-=-). These datasets are detailed in Table 1. We selected these datasets as they are either well understood in terms of feature relevance or they contain many features and are thus good candidates for fe... |

1 |
On information and su Statistics 22
- Kullback, Leibler
- 1951
(Show Context)
Citation Context ... = fG). Our goal is to select G so that these two distributions are as close as possible. As our distance metric, we use the information-theoretic measure of cross-entropy (also known as KL-distance (=-=Kullback & Leibler 1951-=-)). Thus, we can view this as selecting a set of features G which cause us to lose the least amount of information in these distributions. Formally, let and be two distributions over some probability ... |

1 | Induction of selective bayesian classi ers - ciency, P, et al. - 1994 |