#### DMCA

## Irrelevant Features and the Subset Selection Problem (1994)

### Cached

### Download Links

- [robotics.stanford.edu]
- [www-cs-students.stanford.edu]
- [ai.stanford.edu]
- [www.machinelearning.net]
- [machine-learning.martinsewell.com]
- DBLP

### Other Repositories/Bibliography

Venue: | MACHINE LEARNING: PROCEEDINGS OF THE ELEVENTH INTERNATIONAL |

Citations: | 737 - 26 self |

### Citations

6465 |
C4.5: Programs for Machine Learning
- QUINLAN
- 1993
(Show Context)
Citation Context ...he feature "correlated " matches the class label 75% of the time. The left subtree is the correct decision tree, which is correctly induced if the "correlated" feature is removed f=-=rom the data. C4.5 (Quinlan 1992) and CART-=- (Breiman et al. 1984) induce similar trees with the "correlated" feature at the root. Such a split causes all these induction algorithms to generate trees that are less accurate than if thi... |

5792 |
Classification and Regression Trees
- BREIMAN, FRIEDMAN, et al.
- 1984
(Show Context)
Citation Context ... matches the class label 75% of the time. The left subtree is the correct decision tree, which is correctly induced if the "correlated" feature is removed from the data. C4.5 (Quinlan 1992) =-=and CART (Breiman et al. 1984) induce s-=-imilar trees with the "correlated" feature at the root. Such a split causes all these induction algorithms to generate trees that are less accurate than if this feature is completely removed... |

4282 | Induction of decision trees
- Quinlan
- 1986
(Show Context)
Citation Context ...uced concepts which depend on irrelevant features, or in some cases even relevant features that hurt the overall accuracy. Figure 1 shows such a choice of a non-optimal split at the root made by ID3 (=-=Quinlan 1986). The Boolean targe-=-t concept is (A0sA1)s(B0sB1). The feature named "irrelevant " is uniformly random, and the feature "correlated " matches the class label 75% of the time. The left subtree is the co... |

1471 |
Applied Regression Analysis
- DRAPER, SMITH
- 1966
(Show Context)
Citation Context ... forward versions is that the backward version starts with all features and the forward version starts with no features. The algorithms are straightforward and are described in many statistics books (=-=Draper & Smith 1981-=-; Neter, Wasserman, & Kutner 1990) under the names backward stepwise eliminationsand forward stepwise selection. One only has to be careful to set the degradation and improvement margins so that cycle... |

885 | Applied linear statistical models - Neter, Wasserman, et al. - 1990 |

841 |
UCI repository of machine learning databases
- Murphy, Aha
- 1994
(Show Context)
Citation Context ...igh variance, we call this deterministic variant RelieveD. In our experiments, features with relevancy rankings below 0 were removed. The real-world datasets were taken from the UC-Irvine repository (=-=Murphy & Aha 1994-=-) and from Quinlan (1992) . Figures 5 and 6 summarize our results. We give details for those datasets that had the largest differences either in accuracy or tree size. Artificial datasets CorrAL This ... |

827 |
Pattern Recognition. A Statistical Approach
- Devijver, Kittler
- 1992
(Show Context)
Citation Context ...l one. 5 RELATED WORK Researchers in statistics (Boyce, Farhi, & Weischedel 1974; Narendra & Fukunaga 1977; Draper & Smith 1981; Miller 1990; Neter, Wasserman, & Kutner 1990) and pattern recognition (=-=Devijver & Kittler 1982-=-; Ben-Bassat 1982) have investigated the feature subset selection problem for decades, but most work has concentrated on subset selection using linear regression. Sequential backward elimination, some... |

765 | Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm - Littlestone - 1988 |

538 | Very simple classification rules perform well on most commonly used datasets
- HOLTE
- 1993
(Show Context)
Citation Context ... improvement of prediction accuracy over C4.5 is that C4.5 does quite well on most of the datasets tested here, leaving little room for improvement. This seems to be in line with with Holte's claims (=-=Holte 1993-=-). Harder datasets might show more significant improvement. Indeed the wrapper model produced the most significant improvement for the two datasets (parity5+5 and CorrAL) on which C4.5 performed the w... |

509 |
A practical approach to feature selection
- Kira, Rendell
- 1992
(Show Context)
Citation Context ...will select all strongly relevant features, none of the irrelevant ones, and a smallest subset of the weakly relevant features that are sufficient to determine the concept. Algorithms such as Relief (=-=Kira & Rendell 1992-=-a; 1992b; Kononenko 1994) (see Section 3.1) attempt to efficiently approximate the set of relevant features. 3 FEATURE SUBSET SELECTION There are a number of different approaches to subset selection. ... |

450 | Estimating attributes: Analysis and extensions of RELIEF
- Kononenko
- 1994
(Show Context)
Citation Context ...vant features, none of the irrelevant ones, and a smallest subset of the weakly relevant features that are sufficient to determine the concept. Algorithms such as Relief (Kira & Rendell 1992a; 1992b; =-=Kononenko 1994-=-) (see Section 3.1) attempt to efficiently approximate the set of relevant features. 3 FEATURE SUBSET SELECTION There are a number of different approaches to subset selection. In this section, we clai... |

428 |
Computer Systems that Learn
- Weiss, Kulikowski
- 1991
(Show Context)
Citation Context ...iven a subset of features, we want to estimate the accuracy of the induced structure using only the given features. We propose evaluating the subset using nfold cross validation (Breiman et al. 1984; =-=Weiss & Kulikowski 1991-=-). The training data is split into n approximately equally sized partitions. The induction algorithm is then run n times, each time using n \Gamma 1 partitions as the training set and the other partit... |

399 |
Some comments on cp
- Mallows
- 1973
(Show Context)
Citation Context ...any measures have been suggested to evaluate the subset selection (as opposed to cross validation), such as adjusted mean squared error, adjusted multiple correlation coe cient, and the Cp statistic (=-=Mallows 1973-=-). In Mucciardi & Gose (1971), seven di erent techniques for subset selection were empirically compared for a nine-class electrocardiographic problem. The search for the best subset can be improved by... |

361 |
Subset selection in regression
- Miller
- 2002
(Show Context)
Citation Context ...dundant features. Thus the best feature subset is not always the minimal one. 5 RELATED WORK Researchers in statistics (Boyce, Farhi, & Weischedel 1974; Narendra & Fukunaga 1977; Draper & Smith 1981; =-=Miller 1990-=-; Neter, Wasserman, & Kutner 1990) and pattern recognition (Devijver & Kittler 1982; Ben-Bassat 1982) have investigated the feature subset selection problem for decades, but most work has concentrated... |

358 |
The feature selection problem: Traditional methods and a new algorithm
- Kira, Rendell
- 1992
(Show Context)
Citation Context ...will select all strongly relevant features, none of the irrelevant ones, and a smallest subset of the weakly relevant features that are sufficient to determine the concept. Algorithms such as Relief (=-=Kira & Rendell 1992-=-a; 1992b; Kononenko 1994) (see Section 3.1) attempt to efficiently approximate the set of relevant features. 3 FEATURE SUBSET SELECTION There are a number of different approaches to subset selection. ... |

291 |
Stochastic complexity and modeling
- Rissanen
- 1986
(Show Context)
Citation Context ...ng values to a set of features, and the task is to induce a hypothesis that accurately predicts the label of novel instances. Following Occam's razor (Blumer et al. 1987), minimum description length (=-=Rissanen 1986-=-), and minimum message length (Wallace & Freeman 1987), one usually attempts to find structures that correctly classify a large subset of the training set, and yet are not so complex that they begin t... |

260 |
A branch and bound algorithm for feature subset selection
- Narendra, Fukunaga
- 1977
(Show Context)
Citation Context ...induce a hypothesis which makes use of these redundant features. Thus the best feature subset is not always the minimal one. 5 RELATED WORK Researchers in statistics (Boyce, Farhi, & Weischedel 1974; =-=Narendra & Fukunaga 1977-=-; Draper & Smith 1981; Miller 1990; Neter, Wasserman, & Kutner 1990) and pattern recognition (Devijver & Kittler 1982; Ben-Bassat 1982) have investigated the feature subset selection problem for decad... |

250 | Learning with many irrelevant features
- Almuallim, Dietterich
- 1991
(Show Context)
Citation Context ...y definition. In Example 1, feature X 1 is strongly relevant; features X 2 and X 4 are weakly relevant; and X 3 and X 5 are irrelevant. Figure 2 shows our view of relevance. Algorithms such as FOCUS (=-=Almuallim & Dietterich 1991-=-) (see Section 3.1) find a minimal set of features that are sufficient to determine the concept. Given enough data, these algorithms will select all strongly relevant features, none of the irrelevant ... |

226 | Models of incremental concept formation - Gennari, Langley, et al. - 1990 |

223 | Training a 3-node neural network is NP-complete
- Blum, Rivest
- 1988
(Show Context)
Citation Context ...fit the data. Ideally, the induction algorithm should use only the subset of features that leads to the best performance. Since induction of minimal structures is NP-hard in many cases (Hancock 1989; =-=Blum & Rivest 1992-=-), algorithms usually conduct a heuristic search in the Correlated A1 0 Irrelevant 1 B0 0 A0 1 A0 0 1 1 0 0 B1 1 B1 0 1 1 0 0 1 1 0 0 B0 1 0 0 1 1 0 0 A1 1 B0 0 1 1 0 0 1 1 Figure 1: An example where ... |

217 |
Estimation and inference by compact coding
- Wallace, Freeman
- 1987
(Show Context)
Citation Context ... is to induce a hypothesis that accurately predicts the label of novel instances. Following Occam's razor (Blumer et al. 1987), minimum description length (Rissanen 1986), and minimum message length (=-=Wallace & Freeman 1987-=-), one usually attempts to find structures that correctly classify a large subset of the training set, and yet are not so complex that they begin to overfit the data. Ideally, the induction algorithm ... |

213 |
Boolean feature discovery in empirical learning
- Pagallo, Haussler
- 1990
(Show Context)
Citation Context ...ure 1) shows that common algorithms such as ID3, C4.5, and CART, fail to ignore features which, if ignored, would improve accuracy. Feature subset selection is also useful for constructive induction (=-=Pagallo & Haussler 1990-=-) where features can be constructed and tested using the wrapper model to determine if they improve performance. Finally, in real world applications, features may have an associated cost (i.e., when t... |

212 | Greedy attribute selection - Caruana, Freitag - 1994 |

198 | Eds.) ''The MONK's Problems: A Performance Comparison of Different Learning Algorithms - Thrun, Mitchell, et al. - 1991 |

197 | Classi cation and regression trees - Breiman, Friedman, et al. - 1984 |

160 |
Some comments on C p
- Mallows
- 1973
(Show Context)
Citation Context ... measures have been suggested to evaluate the subset selection (as opposed to cross validation), such as adjusted mean squared error, adjusted multiple correlation coefficient, and the C p statistic (=-=Mallows 1973-=-). In Mucciardi & Gose (1971), seven different techniques for subset selection were empirically compared for a nine-class electrocardiographic problem. The search for the best subset can be improved b... |

160 | Prototype and feature selection by sampling and random mutation hillclimbing algorithms - Skalak - 1994 |

148 | Efficient algorithms for minimizing cross validation error - Moore, Lee - 1994 |

123 |
On automatic feature selection
- Siedlecki, Sklansky
(Show Context)
Citation Context ...lus `--take away r." Branch and bound algorithms were introduced by Narendra & Fukunaga (1977). Finally, more recent papers attempt to use AI techniques, such as beam search and bidirectional sea=-=rch (Siedlecki & Sklansky 1988-=-), best first search (Xu, Yan, & Chang 1989), and genetic algorithms (Vafai & De Jong 1992). Many measures have been suggested to evaluate the subset selection (as opposed to cross validation), such a... |

113 | Using decision trees to improve case-based learning - Cardie - 1993 |

77 | Decision trees and diagrams - Moret - 1982 |

75 | Efficiently inducing determinations: A complete and systematic search algorithm that uses optimal pruning - Schlimmer - 1993 |

71 | On the effectiveness of receptors in recognition systems - Marill, Green - 1963 |

61 | Feature selection using rough sets theory - Modrzejewski - 1993 |

56 | Efficient pruning methods for separate-and-conquer rule learning systems - Cohen - 1993 |

52 | Genetic Algorithms as a Tool for Feature Selection - Vafaie, Jong - 1992 |

42 | Oblivious Decision Trees and Abstract Cases - Langley, Sage - 1994 |

41 | A comparison of seven techniques for choosing subsets of pattern recognition properties - Mucciardi, Gose - 1971 |

37 |
Very simple classi rules perform well on most commonly used datasets
- Holte
- 1993
(Show Context)
Citation Context ... improvement of prediction accuracy over C4.5 is that C4.5 does quite well on most of the datasets tested here, leaving little room for improvement. This seems to be in line with with Holte's claims (=-=Holte 1993-=-). Harder datasets might show more signicant improvement. Indeed the wrapper model produced the most signicant improvement for the two datasets (parity5+5 and CorrAL) on which C4.5 performed the wor... |

32 |
Use of distance measures, information measures and error bounds on feature evaluation
- BEN-BASSAT
- 1987
(Show Context)
Citation Context ...earchers in statistics (Boyce, Farhi, & Weischedel 1974; Narendra & Fukunaga 1977; Draper & Smith 1981; Miller 1990; Neter, Wasserman, & Kutner 1990) and pattern recognition (Devijver & Kittler 1982; =-=Ben-Bassat 1982-=-) have investigated the feature subset selection problem for decades, but most work has concentrated on subset selection using linear regression. Sequential backward elimination, sometimes called sequ... |

23 |
Very simple classi cation rules perform well on most commonly used datasets
- Holte
- 1993
(Show Context)
Citation Context ... improvement of prediction accuracy over C4.5 is that C4.5 does quite well on most of the datasets tested here, leaving little room for improvement. This seems to be in line with with Holte's claims (=-=Holte 1993-=-). Harder datasets might show more signi cant improvement. Indeed the wrapper model produced the most signi cantimprovement for the two datasets (parity5+5 and CorrAL) on which C4.5 performed the wors... |

22 | Irrelevance Reasoning in Knowledge Based Systems - Levy - 1993 |

12 | Optimal Subset Selection - BOYCE, FARHI, et al. - 1974 |

9 | Models of incremental concept formation. Arti Intelligence 40:11{61 - Gennari, Langley, et al. - 1989 |

5 |
On the Difficulty of Finding Small Consistent Decision Trees
- Hancock
- 1989
(Show Context)
Citation Context ... begin to overfit the data. Ideally, the induction algorithm should use only the subset of features that leads to the best performance. Since induction of minimal structures is NP-hard in many cases (=-=Hancock 1989-=-; Blum & Rivest 1992), algorithms usually conduct a heuristic search in the Correlated A1 0 Irrelevant 1 B0 0 A0 1 A0 0 1 1 0 0 B1 1 B1 0 1 1 0 0 1 1 0 0 B0 1 0 0 1 1 0 0 A1 1 B0 0 1 1 0 0 1 1 Figure ... |

4 | The Use of Knowledge in Analogy and Induction - Russel - 1989 |

4 |
On the di culty of nding small consistent decision trees
- Hancock
- 1989
(Show Context)
Citation Context ...y begin to over t the data. Ideally, the induction algorithm should use only the subset of features that leads to the best performance. Since induction of minimal structures is NP-hard in many cases (=-=Hancock 1989-=-� Blum & Rivest 1992), algorithms usually conduct a heuristic search inthe Ron Kohavi Computer Science Dept. Stanford University Stanford, CA 94305 ronnyk@CS.Stanford.EDU 0 0 0 0 B0 B1 1 1 1 Karl P eg... |

2 | Preliminary steps toward the automation of induction - Russel - 1986 |

2 |
Some comments on c p . Technometrics 15:661{675
- Mallows
- 1973
(Show Context)
Citation Context ...ny measures have been suggested to evaluate the subset selection (as opposed to cross validation), such as adjusted mean squared error, adjusted multiple correlation coecient, and the C p statistic (=-=Mallows 1973-=-). In Mucciardi & Gose (1971), seven dierent techniques for subset selection were empirically compared for a nine-class electrocardiographic problem. The search for the best subset can be improved by... |

1 |
On the diculty of small consistent decision trees
- Hancock
- 1989
(Show Context)
Citation Context ...y begin to overt the data. Ideally, the induction algorithm should use only the subset of features that leads to the best performance. Since induction of minimal structures is NP-hard in many cases (=-=Hancock 1989-=-; Blum & Rivest 1992), algorithms usually conduct a heuristic search in the Correlated A1 0 Irrelevant 1 B0 0 A0 1 A0 0 1 1 0 0 B1 1 B1 0 1 1 0 0 1 1 0 0 B0 1 0 0 1 1 0 0 A1 1 B0 0 1 1 0 0 1 1 Figure ... |

1 |
Stochastic complexity and modeling. Ann. Statist 14:1080{1100
- Rissanen
- 1986
(Show Context)
Citation Context ...ng values to a set of features, and the task is to induce a hypothesis that accurately predicts the label of novel instances. Following Occam's razor (Blumer et al. 1987), minimum description length (=-=Rissanen 1986-=-), and minimummessage length (Wallace & Freeman 1987), one usually attempts tosnd structures that correctly classify a large subset of the training set, and yet are not so complex that they begin to o... |