#### DMCA

## Selection of relevant features and examples in machine learning (1997)

### Cached

### Download Links

- [yaroslavvb.com]
- [www.isle.org]
- [www.cs.cmu.edu]
- [www.cs.tu.ac.th]
- DBLP

### Other Repositories/Bibliography

Venue: | ARTIFICIAL INTELLIGENCE |

Citations: | 589 - 1 self |

### Citations

13831 | Computers And Intractability: A Guide to the Theory of NP-Completeness - Garey, Johnson - 1979 |

6458 | C4.5: Programs for Machine Learning - Quinlan - 1993 |

5785 |
Classification and regression trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...these methods also involve routines for combining features into richer descriptions. For example, recursive partitioning methods for induction, such as Quinlan's ID3 (1983) and C4.5 (1993), and CART (=-=Breiman et al. 1984-=-), carry out a greedy search through the space of decision trees, at each stage using an evaluation function to select the attribute that has the best ability to discriminate among the classes. They p... |

3618 | Learning internal representations by error propagation - Rumelhart, Hinton, et al. - 1986 |

3316 |
Principal Component Analysis
- Jolliffe
- 2002
(Show Context)
Citation Context ...reserves the existence of a good function in the class C?" This notion of relevance is often most natural for statistical approaches to learning. Indeed, methods such as principal component analy=-=sis (Jolliffe, 1986-=-) are commonly used as heuristics for finding these lowdimensional subspaces. Selecting Relevant Features and Examples Page 5 Figure 1. Each state in the space of feature subsets specifies the attribu... |

1798 |
Independent component analysis, a new concept
- Comon
- 1994
(Show Context)
Citation Context ...hods of this form, when the target function is an intersection of halfspaces and the examples are chosen from a sufficiently benign distribution. The related method of independent component analysis (=-=Comon, 1994-=-) incorporates similar ideas, but insists only that the new features be independent rather than orthogonal. 2.5 Wrapper Approaches to Feature Selection A third generic approach for feature selection a... |

1343 | Nearest neighbor pattern classification - Cover, Hart - 1967 |

883 | The CN2 induction algorithm
- Clark, Niblett
- 1989
(Show Context)
Citation Context ...e dictionary. For all these cases, the feature-selection process is clearly embedded within another, more complex algorithm. Separate-and-conquer methods for learning decision lists (Michalski, 1980� =-=Clark & Niblett, 1989-=-� Pagallo & Haussler, 1990) embed feature selection in a similar manner. These techniques use an evaluation function to select a feature that helps distinguish a class C from others, then add the resu... |

853 | The weighted majority algorithm - Littlestone, Warmuth - 1994 |

850 | The strength of weak learnability
- Schapire
- 1990
(Show Context)
Citation Context ...ing cases, and when it achieves 10% error, it will ignore 90% of the data. In the PAC model, learning algorithms need to roughly double the number of examples seen in order to halve their error rate (=-=Schapire, 1990-=-; Freund, 1992; Blumer et al., 1989). However, for conservative algorithms, since the number of examples actually used for learning is proportional to the error rate, the number of new examples used b... |

827 |
Pattern Recognition. A Statistical Approach
- Devijver, Kittler
- 1992
(Show Context)
Citation Context ...ng data and using the estimated accuracy of the resulting classi er as its metric. 3 Actually, the wrapper scheme has a long history within the literature on statistics and pattern recognition (e.g., =-=Devijver & Kittler, 1982-=-), where the problem of feature selection has long been an active research topic, but its use within machine learning is relatively recent. The general argument for wrapper approaches is that the indu... |

820 |
Approximation Algorithms for Combinatorial Problems
- Johnson
- 1974
(Show Context)
Citation Context ...h the training set, then this method will find one. In fact, the number of features selected by this method is at most O(log jSj) times larger than the number of relevant features using Definition 4 (=-=Johnson, 1974-=-; Haussler, 1986). 1 We can also use this algorithm to illustrate relationships between some of the definitions in the previous section. For instance, the incrementally useful features for this algori... |

765 | Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm - Littlestone - 1987 |

754 |
Adaptive switching circuits
- Widrow, Hoff
- 1960
(Show Context)
Citation Context ...he perceptron updating rule (Minsky & Papert, 1969), which adds or subtracts weights on a linear threshold unit in response to errors on training instances. The least-mean squares algorithm (Widrow & =-=Hoff, 1960-=-) for linear units and backpropagation (Rumelhart, Hinton, & Williams, 1986), its generalization for multilayer neural networks, also make additive changes to a set of weights in order to reduce error... |

737 | Irrelevant features and the subset selection problem - John, Kohavi, et al. - 1994 |

713 | Learnability and the Vapnik-Chervonenkis dimension
- Blumer, Ehrenfeucht, et al.
- 1989
(Show Context)
Citation Context ...ized as a conjunction (or disjunction) of a list of functions produced by the induction algorithm. Situations of this form include learning intersections of halfspaces in constant-dimensional spaces (=-=Blumer et al., 1989-=-), and algorithms for learning DNF formulas in n O(log n) time under the uniform distribution (Verbeurgt, 1990). The above results for the greedy set-cover method are distribution free and worst case,... |

684 | An introduction to computational learning theory - Kearns, Vazirani - 1994 |

670 | Active learning with statistical models
- Cohn, Ghahramani, et al.
- 1996
(Show Context)
Citation Context ...n are tasks that seem to be intimately related and we need more studies designed to help understand and quantify this relationship. Much of the empirical work on example selection (e.g., Gross, 1991� =-=Cohn et al., 1996-=-) has dealt with low-dimensional spaces, yet this approach clearly holds even greater potential for domains involving many irrelevant features. Resolving basic issues of this sort promises to keep the... |

665 | Perceptrons: An Introduction to Computational Geometry - Minsky, Papert - 1969 |

663 |
Learning Regular Sets from Queries and Counterexamples
- Angluin
- 1987
(Show Context)
Citation Context ...ition; for instance, Cohn, Ghahramani, and Jordan (1996) report successful results with a system that selects examples designed to reduce the learner's variance. In parallel, theoretical researchers (=-=Angluin, 1987-=-; Angluin et al., 1993; Bshouty, 1993; Rivest & Schapire, 1993; Jackson, 1994) have shown that the ability to generate queries greatly enlarges the types of concept classes for which one can guarantee... |

646 |
Generalization as search
- Mitchell
- 1982
(Show Context)
Citation Context ... inducing logical descriptions provide the clearest example of feature selection methods embedded within a basic induction algorithm. In fact, many algorithms for inducing logical conjunctions (e.g., =-=Mitchell, 1982-=-; Vere, 1975; Winston, 1975; and the greedy set-cover algorithm given above) do little more than add or remove features from the concept description in response to prediction errors on new instances. ... |

538 | Very Simple Classification Rules Perform Well on Most Commonly Used Datasets - Holte - 1993 |

508 | Boosting a Weak Learning Algorithm by Majority - Freund - 1995 |

508 | A practical approach to feature selection - Kira, Rendell - 1992 |

468 | Toward optimal feature selection - Koller, Sahami - 1996 |

449 | Estimating attributes: Analysis and extensions of relief - Kononenko - 1994 |

426 | Query by committee - Seung, Opper, et al. - 1992 |

424 | On the hardness of approximating minimization problems - LUND, YANNAKAKIS - 1993 |

369 | How to use expert advice - Cesa-Bianchi, Freund, et al. - 1997 |

365 | Learning Efficient Classification Procedures And Their Application To Chess End Games - Quinlan - 1983 |

345 | WebWatcher: A learning apprentice for the world wide web
- Armstrong, Freitag, et al.
- 1995
(Show Context)
Citation Context ...ayesian updating, even when the probabilistic assumptions of that approach are not met. Experimental tests of Winnow and related multiplicative methods on natural domains have revealed good behavior (=-=Armstrong et al., 1995-=-; Blum, 1995), and studies with synthetic data show that they scale very well to domains with even thousands of irrelevant features (Littlestone & Mesterharm, 1997). More generally, weighting methods ... |

314 | Approximate counting, uniform generation and rapidly mixing Markov chains - SINCLAIR, JERRUM - 1989 |

313 | Self-Improving reactive agents based on reinforcement learning, planning and teaching
- Lin
- 1992
(Show Context)
Citation Context ...Scott and Markovitch (1991) adapt this idea to unsupervised learning situations, and many methods for reinforcement learning include a bias toward exploring unfamiliar parts of the state space (e.g., =-=Lin, 1992-=-). Both approaches can considerably increase learning rates over random presentations. Most work on selecting and querying unlabeled data has used embedded methods, but Angluin et al. (1993) and Blum ... |

297 | Aggregating strategies - Vovk - 1990 |

261 | Induction of selective Bayesian classifiers - Langley, Sage - 1994 |

249 | Learning with many irrelevant features - Almuallim, Dietterich - 1991 |

212 | Greedy attribute selection - Caruana, Freitag - 1994 |

191 | Inference of Finite Automata using Homing Sequences - Rivest, Schapire - 1993 |

171 | An efficient membership-query algorithm for learning DNF with respect to the uniform distribution
- Jackson
- 1997
(Show Context)
Citation Context ...sults with a system that selects examples designed to reduce the learner's variance. In parallel, theoretical researchers (Angluin, 1987; Angluin et al., 1993; Bshouty, 1993; Rivest & Schapire, 1993; =-=Jackson, 1994-=-) have shown that the ability to generate queries greatly enlarges the types of concept classes for which one can guarantee polynomial-time learning. Although much work on queries and experimentation ... |

165 | A random polynomial time algorithm for approximating the volume of convex bodies
- Dyer, Frieze, et al.
- 1989
(Show Context)
Citation Context ...ci cally, this method requires an ability to sample random consistent hypotheses, which can be quite di cult, although it is also a major topic of algorithmic research (e.g., Sinclair & Jerrum, 1989� =-=Dyer, Frieze, & Kannan, 1989-=-� and Lovasz & Simonovits, 1992). There has been a larger body of work on algorithms that generate examples of their own choosing, under the heading of membership query algorithms within the theoretic... |

159 | Prototype and feature selection by sampling and random mutation hill climbing algorithms - Skalak - 1994 |

157 | The power of decision tables - Kohavi - 1995 |

148 | Empirical support for winnow and weighted-majority algorithms: results on a calendar scheduling domain
- Blum
- 1995
(Show Context)
Citation Context ...hen the probabilistic assumptions of that approach are not met. Experimental tests of Winnow and related multiplicative methods on natural domains have revealed good behavior (Armstrong et al., 1995; =-=Blum, 1995-=-), and studies with synthetic data show that they scale very well to domains with even thousands of irrelevant features (Littlestone & Mesterharm, 1997). More generally, weighting methods are often ca... |

148 | Efficient algorithms for minimizing cross validation error - Moore, Lee - 1994 |

146 | Additive versus exponentiated gradient updates for learning linear functions - Kivinen, Warmuth - 1994 |

140 | A comparative evaluation of sequential feature selection algorithms - Aha, Bankert - 1996 |

129 | Feature Selection and Feature Extraction for Text Categorization
- Lewis
- 1992
(Show Context)
Citation Context ... performance. For instance, it is not uncommon in a text classification task to represent examples using 10 4 to 10 7 attributes, with the expectation that only a small fraction of these are crucial (=-=Lewis, 1992-=-a; Lewis, 1992b). In recent years, a growing amount of work in machine learning -- both experimental and theoretical in nature -- has focused on developing algorithms with such desirable properties. I... |

128 | Weakly Learning DNF and Characterizing Statistical Query Learning using Fourier Analysis
- Blum, Furst, et al.
- 1994
(Show Context)
Citation Context ...DNF formulas using membership queries, with respect to the uniform distribution. 18 "unusual" in the sense that the class has been proven impossible to learn in the statistical query model o=-=f Kearns (Blum et al., 1994-=-). Thus, issues of finding relevant features seem to be at the core of what makes those classes hard. As a practical matter, it is unclear how to experimentally test a proposed algorithm for this prob... |

115 | Learning read-once formulas with queries
- Angluin, Hellerstein, et al.
- 1989
(Show Context)
Citation Context ...ance, Cohn, Ghahramani, and Jordan (1996) report successful results with a system that selects examples designed to reduce the learner's variance. In parallel, theoretical researchers (Angluin, 1987� =-=Angluin et al., 1993-=-� Bshouty, 1993� Rivest & Schapire, 1993� Jackson, 1994) have shown that the ability to generate queries greatly enlarges the types of concept classes for which one can guarantee polynomial-time learn... |

113 | Using decision trees to improve case-based learning - Cardie - 1993 |

110 |
Pattern Recognition as Rule-Guided Inductive Inference
- Michalski
- 1980
(Show Context)
Citation Context ...sible words in the dictionary. For all these cases, the feature-selection process is clearly embedded within another, more complex algorithm. Separate-and-conquer methods for learning decision lists (=-=Michalski, 1980-=-; Clark & Niblett, 1989; Pagallo & Haussler, 1990) embed feature selection in a similar manner. These techniques use an evaluation function to select a feature that helps distinguish a class C from ot... |

106 | Learning Concepts By Asking Questions - Sammut, Banerji - 1986 |

102 | Improving performance in neural networks using a boosting algorithm - Drucker, Schapire, et al. - 1992 |

98 | Constructive induction on decision trees - Matheus, Rendell - 1989 |

84 |
Exact learning via monotone theory
- BSHOUTY
- 1993
(Show Context)
Citation Context ..., and Jordan (1996) report successful results with a system that selects examples designed to reduce the learner's variance. In parallel, theoretical researchers (Angluin, 1987; Angluin et al., 1993; =-=Bshouty, 1993-=-; Rivest & Schapire, 1993; Jackson, 1994) have shown that the ability to generate queries greatly enlarges the types of concept classes for which one can guarantee polynomial-time learning. Although m... |

81 |
A study of instance-based algorithms for supervised learning tasks: Mathematical, empirical, and psychological evaluations. Doctoral dissertation
- Aha
- 1990
(Show Context)
Citation Context ...accuracy (similar to the PAC notion of sample complexity) grows exponentially with the number of irrelevant attributes, even for conjunctive target concepts. Experimental studies of nearest neighbor (=-=Aha, 1990-=-; Langley & Sage, 1997) are consistent with this discouraging conclusion. At the other extreme lie induction methods that explicitly attempt to select relevant features and reject irrelevant ones. Tec... |

81 | Learning boolean functions in an infinite attribute space - Blum - 1992 |

70 | The acquisition of stress: A data-oriented approach - Daelemans, Gillis, et al. - 1994 |

58 | Learning in the presence of finitely or infinitely many irrelevant attributes - Blum, Hellerstein, et al. - 1995 |

54 |
Generating Better Decision Trees
- Norton, W
- 1989
(Show Context)
Citation Context ...xtreme example of this situation, but it also arises with other target concepts. 2 Some researchers have attempted to remedy these problems by replacing greedy search with lookahead techniques (e.g., =-=Norton, 1989-=-), with some success. Of course, more extensive search carries with it a significant increase in computational cost. Others have responded by selectively defining new features as combinations of exist... |

47 |
Induction of concepts in the predicate calculus
- Vere
- 1975
(Show Context)
Citation Context ...l descriptions provide the clearest example of feature selection methods embedded within a basic induction algorithm. In fact, many algorithms for inducing logical conjunctions (e.g., Mitchell, 1982; =-=Vere, 1975-=-; Winston, 1975; and the greedy set-cover algorithm given above) do little more than add or remove features from the concept description in response to prediction errors on new instances. For these me... |

46 |
An improved boosting algorithm and its implications on learning complexity
- Freund
- 1992
(Show Context)
Citation Context ...hen it achieves 10% error, it will ignore 90% of the data. In the PAC model, learning algorithms need to roughly double the number of examples seen in order to halve their error rate (Schapire, 1990; =-=Freund, 1992-=-; Blumer et al., 1989). However, for conservative algorithms, since the number of examples actually used for learning is proportional to the error rate, the number of new examples used by the algorith... |

44 | On-line learning of linear functions - Littlestone, Long, et al. - 1995 |

42 | Average-case analysis of a nearest neighbor algorithm - Langley, Iba - 1993 |

38 | The utility of feature weighting in nearest-neighbor algorithms - Kohavi, Langley, et al. - 1997 |

37 | Rule creation and rule learning through environmental exploration - Shen, Simon - 1989 |

34 | On the randomized complexity of volume and diameter - Lovász, Simonovits - 1992 |

27 | Experimentation in machine discovery - KULKARNI, SIMON - 1990 |

27 | Memory-based reasonning applied to english pronunciation - Stanfill - 1987 |

26 | A computational approach to theory revision - RAJAMONEY - 1990 |

23 | Boosting and other machine learning algorithms - Drucker, Cortes, et al. - 1994 |

23 | Efficient domain-independent experimentation - Gil - 1993 |

21 | PAC Learning with Irrelevant Attributes - Dhagat, Hellerstein - 1994 |

19 | Peepholing: Choosing attributes efficiently for megainduction - Catlett - 1992 |

18 | Learning an intersection of k halfspaces over a uniform distribution - Blum, Kannan - 1993 |

18 | A framework for average case analysis of conjunctive learning algorithms - Pazzani, Sarrett - 1990 |

18 | Discretization of continuous-valued attributes and instance-based learning - Ting - 1994 |

10 |
Quantifying the inductive bias in concept learning
- Haussler
- 1986
(Show Context)
Citation Context ...set, then this method will find one. In fact, the number of features selected by this method is at most O(log jSj) times larger than the number of relevant features using Definition 4 (Johnson, 1974; =-=Haussler, 1986-=-). 1 We can also use this algorithm to illustrate relationships between some of the definitions in the previous section. For instance, the incrementally useful features for this algorithm (Definition ... |

10 | Oblivious decision trees and abstract cases. Working - Langley, Sage - 1994 |

9 |
Representation and learning in information retrieval (Doctoral dissertation). Retrieved from ProQuest Dissertations and Theses database (Order No
- Lewis
- 1992
(Show Context)
Citation Context ... performance. For instance, it is not uncommon in a text classification task to represent examples using 10 4 to 10 7 attributes, with the expectation that only a small fraction of these are crucial (=-=Lewis, 1992-=-a; Lewis, 1992b). In recent years, a growing amount of work in machine learning -- both experimental and theoretical in nature -- has focused on developing algorithms with such desirable properties. I... |

8 |
Concept acquisition through attribute evolution and experiment selection
- Gross
- 1991
(Show Context)
Citation Context ...mple selection are tasks that seem to be intimately related and we need more studies designed to help understand and quantify this relationship. Much of the empirical work on example selection (e.g., =-=Gross, 1991-=-; Cohn et al., 1996) has dealt with low-dimensional spaces, yet this approach clearly holds even greater potential for domains involving many irrelevant features. Resolving basic issues of this sort p... |

8 | An Apobayesian Relative of Winnow, in - Littlestone, Mesterharm - 1997 |

7 | Scaling to domains with many irrelevant features - Langley, Sage - 1993 |

7 | Heterogeneous Uncertainty Sampling for - Lewis, Catlett - 1994 |

7 | Representation generation in an exploratory learning system - Scott, Markovitch - 1991 |

7 | Learning boolean functions in an in nite attribute space - Blum - 1992 |

6 | A method for inferring context-free grammars - Knobe, Knobe - 1976 |

6 | Efficient learning of selective Bayesian network classifiers - Provan - 1996 |

5 | An evaluation of feature-selection methods and their application to computer security - Doak - 1992 |

5 | A comparison of induction algorithms for selective and non-selective Bayesian classifiers - Provan - 1995 |

4 | Static vs. dynamic sampling for data mining - John - 1996 |

4 | Discovering patterns in EEG signals: Comparative study of a few methods - Kubat, Flotzinger, et al. - 1993 |

4 | Instancebased prediction of continuous values - Townsend-Weber, Kibler - 1994 |

3 |
Learning DNF under the uniform distribution in polynomial time
- Verbeurgt
- 1990
(Show Context)
Citation Context ...his form include learning intersections of halfspaces in constant-dimensional spaces (Blumer et al., 1989), and algorithms for learning DNF formulas in n O(log n) time under the uniform distribution (=-=Verbeurgt, 1990-=-). The above results for the greedy set-cover method are distribution free and worst case, but Pazzani and Sarrett (1992) report an average-case analysis of even simpler methods for conjunctive learni... |

2 | How useful is relevance? Working - Caruana, Freitag - 1994 |

2 | Efficiently inducing determinations: A complete and efficient search algorithm that uses optimal pruning - Schlimmer - 1987 |

1 |
Selecting Relevant Features and Examples Page 19
- Blum, Furst, et al.
- 1994
(Show Context)
Citation Context ...lt special case. For instance, any algorithm to solve this problem would need to be "unusual" in the sense that the class has been proven impossible to learn in the statistical query model o=-=f Kearns (Blum et al., 1994-=-). Thus, issues of finding relevant features seem to be at the core of what makes those classes hard. As a practical matter, it is unclear how to experimentally test a proposed algorithm for this prob... |

1 | 20 Selecting Relevant Features and Examples - Page - 1992 |

1 | Selecting Relevant Features and Examples Page 21 - Kohavi - 1995 |

1 | Selecting Relevant Features and Examples Page 23 - Scott - 1991 |