#### DMCA

## Solving multiclass learning problems via error-correcting output codes (1995)

### Cached

### Download Links

- [www.cs.cmu.edu]
- [www.jair.org]
- [jair.org]
- [arxiv.org]
- [www.cs.orst.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH |

Citations: | 721 - 8 self |

### Citations

5843 |
Classification and Regression Trees,
- Breiman, Friedman, et al.
(Show Context)
Citation Context ...hx i ; f(x i )i. For cases in which f takes only the values f0; 1g|binary functions|there are many algorithms available. For example, the decision-tree methods, such as C4.5 (Quinlan, 1993) and CART (=-=Breiman, Friedman, Olshen, & Stone, 1984-=-) can construct trees whose leaves are labeled with binary values. Most articial neural network algorithms, such as the perceptron algorithm (Rosenblatt, 1958) and the error backpropagation (BP) algo... |

3644 |
Learning internal representations by error propagation,"
- Hinton, Williams
- 1986
(Show Context)
Citation Context ...onstruct trees whose leaves are labeled with binary values. Most arti cial neural network algorithms, such as the perceptron algorithm (Rosenblatt, 1958) and the error backpropagation (BP) algorithm (=-=Rumelhart, Hinton, & Williams, 1986-=-), are best suited to learning binary functions. Theoretical studies of learning have focused almost entirely on learning binary functions (Valiant, 1984; Natarajan, 1991). In many real-world learning... |

1968 | A theory of the learnable.
- Valiant
- 1984
(Show Context)
Citation Context ...pagation (BP) algorithm (Rumelhart, Hinton, & Williams, 1986), are best suited to learning binary functions. Theoretical studies of learning have focused almost entirely on learning binary functions (=-=Valiant, 1984-=-; Natarajan, 1991). In many real-world learning tasks, however, the unknown function f often takes values from a discrete set of \classes": fc1;:::;c kg. For example, in medical diagnosis, the functio... |

1129 | The perceptron: a probabilistic model for information storage and organization in the brain.
- Rosenblatt
- 1958
(Show Context)
Citation Context ... 1993) and CART (Breiman, Friedman, Olshen, & Stone, 1984) can construct trees whose leaves are labeled with binary values. Most arti cial neural network algorithms, such as the perceptron algorithm (=-=Rosenblatt, 1958-=-) and the error backpropagation (BP) algorithm (Rumelhart, Hinton, & Williams, 1986), are best suited to learning binary functions. Theoretical studies of learning have focused almost entirely on lear... |

975 |
Statistical Methods.
- Snedecor, Cochran
- 1967
(Show Context)
Citation Context ...dicates that the di erence is statistically signi cant at the p<0:05 level according to the test for the di erence of two proportions (using the normal approximation to the binomial distribution, see =-=Snedecor & Cochran, 1989-=-, p. 124). From this gure, we can see that the one-per-class method performs signi cantly worse than the multiclass method in four of the eight domains and that its behavior is statistically indisting... |

855 | The strength of weak learnability.
- Schapire
- 1990
(Show Context)
Citation Context ...de result in this independence? A closely related open problem concerns the relationship between the ECOC approach and various \ensemble", \committee", and \boosting" methods (Perrone & Cooper, 1993; =-=Schapire, 1990-=-; Freund, 1992). These methods construct multiple hypotheses which then \vote" to determine the classi cation of an example. An error-correcting code can also be viewed as a very compact form of votin... |

844 |
UCI Repository of Machine Learning databases:'
- Murphy, Ada
- 1994
(Show Context)
Citation Context ... 4 summarizes the data sets employed in the study. The glass, vowel, soybean, audiologyS, ISOLET, letter, and NETtalk data sets are available from the Irvine Repository of machine learning databases (=-=Murphy & Aha, 1994-=-). 1 The POS (part of speech) data set was provided by C. Cardie (personal communication); an earlier version of the data set was described by Cardie (1993). We did not use the entire NETtalk data set... |

725 | A new method for solving hard satisfiability problems. - Selman, Levesque, et al. - 1992 |

545 | Parallel networks that learn to pronounce english text. - Sejnowski, Rosenberg - 1987 |

467 |
Backpropagation applied to handwritten zip code recognition.
- LeCun, Boser, et al.
- 1989
(Show Context)
Citation Context ...on might map a description of a patient to one of k possible diseases. In digit recognition (e.g., c 1995 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.sDietterich & Bakiri =-=LeCun, Boser, Denker, Henderson, Howard, Hubbard, & Jackel, 1989-=-), the function maps each hand-printed digit to one of k = 10 classes. Phoneme recognition systems (e.g., Waibel, Hanazawa, Hinton, Shikano, & Lang, 1989) typically classify a speech segment into one ... |

407 | Connectionist Learning Procedures”,
- Hinton
- 1989
(Show Context)
Citation Context ... Both the CNAPS and opt attempt to minimize the squared error between the computed and desired outputs of the network. Many researchers have employed other error measures, particularly cross-entropy (=-=Hinton, 1989-=-) and classi cation gure-of-merit (CFM, Hampshire II & Waibel, 1990). Many researchers also advocate using a softmax normalizing layer at the outputs of the network (Bridle, 1990). While each of these... |

382 |
Error-Correcting Codes.
- Peterson, Weldon
- 1972
(Show Context)
Citation Context ...e relatively uncorrelated, so that the number of simultaneous errors in many bit positions is small. If there are many simultaneous errors, the error-correcting code will not be able to correct them (=-=Peterson & Weldon, 1972-=-). The errors in columns i and j will also be highly correlated if the bits in those columns are complementary. This is because algorithms such as C4.5 and backpropagation treat a class and its comple... |

348 | When networks disagree: Ensemble methods for hydrid neural networks.
- Perrone, Cooper
- 1993
(Show Context)
Citation Context ...ror-correcting output code result in this independence? A closely related open problem concerns the relationship between the ECOC approach and various \ensemble", \committee", and \boosting" methods (=-=Perrone & Cooper, 1993-=-; Schapire, 1990; Freund, 1992). These methods construct multiple hypotheses which then \vote" to determine the classi cation of an example. An error-correcting code can also be viewed as a very compa... |

314 | R.P.Lippmann, “Neural Network Classifiers Estimates Bayesian a posteriori Probabilities,” Neural Computation, - Richard - 1991 |

209 | A time-delay neural network architecture for isolated word recognition, - Lang, Waibel, et al. - 1990 |

197 |
Classi cation and regression trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...rm hx i;f(x i)i. For cases in which f takes only the values f0; 1g|binary functions|there are many algorithms available. For example, the decision-tree methods, such as C4.5 (Quinlan, 1993) and CART (=-=Breiman, Friedman, Olshen, & Stone, 1984-=-) can construct trees whose leaves are labeled with binary values. Most arti cial neural network algorithms, such as the perceptron algorithm (Rosenblatt, 1958) and the error backpropagation (BP) algo... |

174 | Learning Machines
- Nilsson
- 1965
(Show Context)
Citation Context ...::;f k, one for each class. To assign a new case, x, to one of these classes, each of the f i is evaluated on x, and x is assigned the class j of the function f j that returns the highest activation (=-=Nilsson, 1965-=-). We will call this the one-per-class approach, since one binary function is learned for each class. An alternative approach explored by some researchers is to employ adistributed output code. This a... |

134 |
On a class of error correcting binary group codes,”
- Bose, Ray-Chaudhuri
- 1960
(Show Context)
Citation Context ...110000, which corresponds to class 4. Hence, this predicts that f(x)=4. This process of mapping the output string to the nearest codeword is identical to the decoding step for error-correcting codes (=-=Bose & Ray-Chaudhuri, 1960-=-; Hocquenghem, 1959). This suggests that there might be some advantage to employing error-correcting codes as a distributed representation. Indeed, the idea of employing error-correcting, distributed ... |

127 |
Machine learning : a theoretical approach
- Natarajan
- 1991
(Show Context)
Citation Context ...lgorithm (Rumelhart, Hinton, & Williams, 1986), are best suited to learning binary functions. Theoretical studies of learning have focused almost entirely on learning binary functions (Valiant, 1984; =-=Natarajan, 1991-=-). In many real-world learning tasks, however, the unknown function f often takes values from a discrete set of \classes": fc1;:::;c kg. For example, in medical diagnosis, the function might map a des... |

122 |
Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimates of parameters
- Bridle
- 1989
(Show Context)
Citation Context ...ularly cross-entropy (Hinton, 1989) and classi cation gure-of-merit (CFM, Hampshire II & Waibel, 1990). Many researchers also advocate using a softmax normalizing layer at the outputs of the network (=-=Bridle, 1990-=-). While each of these con gurations has good theoretical support, Richard and Lippmann (1991) report that squared error works just as well as these other measures in producing accurate posterior prob... |

115 | Using Decision Trees to Improve CaseBased Learning,” - Cardie - 1993 |

67 | A novel objective function for improved phoneme recognition using time delay neural networks - Hampshire, Waibel - 1989 |

67 |
A new method for solving hard satis problems
- Selman, Levesque, et al.
- 1992
(Show Context)
Citation Context ...xhaustive Codes When 8 k 11, we construct an exhaustive code and then select a good subset of its columns. We formulate this as a propositional satisability problem and apply the GSAT algorithm (=-=Selman, Levesque, & Mitchell, 1992-=-) to attempt a solution. A solution is required to include exactly L columns (the desired length of the code) while ensuring that the Hamming distance between every two columns is between d and L d,... |

61 |
A new method for solving hard satis ability problems
- Selman, Levesque, et al.
- 1992
(Show Context)
Citation Context ...om Exhaustive Codes When 8 k 11, we construct an exhaustive code and then select a good subset of its columns. We formulate this as a propositional satis ability problem and apply the GSAT algorithm (=-=Selman, Levesque, & Mitchell, 1992-=-) to attempt a solution. A solution is required to include exactly L columns (the desired length of the code) while ensuring that the Hamming distance between every two columns is between d and L , d,... |

47 |
An improved boosting algorithm and its implications on learning complexity.
- Freund
- 1992
(Show Context)
Citation Context ...s independence? A closely related open problem concerns the relationship between the ECOC approach and various \ensemble", \committee", and \boosting" methods (Perrone & Cooper, 1993; Schapire, 1990; =-=Freund, 1992-=-). These methods construct multiple hypotheses which then \vote" to determine the classi cation of an example. An error-correcting code can also be viewed as a very compact form of voting in which a c... |

38 |
A neural-net training program based on conjugate-gradient optimization
- Barnard, Cole
- 1989
(Show Context)
Citation Context ...en low order bits are lost due to shifting or multiplication). On the speech recognition, letter recognition, and vowel data sets, we employed the opt system distributed by Oregon Graduate Institute (=-=Barnard & Cole, 1989-=-). This implements the conjugate gradient algorithm and updates the gradient after each complete pass through the training examples (known as per-epoch updating). No learning rate is required for this... |

36 |
C4.5: Programs for Empirical Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...aining examples of the form hx i;f(x i)i. For cases in which f takes only the values f0; 1g|binary functions|there are many algorithms available. For example, the decision-tree methods, such as C4.5 (=-=Quinlan, 1993-=-) and CART (Breiman, Friedman, Olshen, & Stone, 1984) can construct trees whose leaves are labeled with binary values. Most arti cial neural network algorithms, such as the perceptron algorithm (Rosen... |

33 | Neural network classi®ers estimate Bayesian a posteriori probabilities. - Richard, Lippman - 1991 |

19 | Converting English text to speech: A machine learning approach - Bakiri - 1991 |

10 |
Codes corecteurs d'erreurs
- Hocquenghem
- 1959
(Show Context)
Citation Context ... class 4. Hence, this predicts that f(x)=4. This process of mapping the output string to the nearest codeword is identical to the decoding step for error-correcting codes (Bose & Ray-Chaudhuri, 1960; =-=Hocquenghem, 1959-=-). This suggests that there might be some advantage to employing error-correcting codes as a distributed representation. Indeed, the idea of employing error-correcting, distributed representations can... |

10 | Neural network classi estimate bayesian a posteriori probabilities - Richard, Lippmann - 1991 |

9 | Phoneme recognition using time-delay networks - Dietterich, Waibel, et al. - 1989 |

3 |
Error-correcting output codes
- Dietterich, Bakiri
- 1995
(Show Context)
Citation Context ... 1000 words = 6 stresses 7,229 letters 7,242 letters dence, the error-correcting output code method would fail. We address this question|for the case of decision-tree algorithms|in a companion paper (=-=Kong & Dietterich, 1995-=-). 2. Methods This section describes the data sets and learning algorithms employed in this study. It also discusses the issues involved in the design of error-correcting codes and describes four algo... |

1 |
Function modeling experiments
- Duda, Machanik
- 1963
(Show Context)
Citation Context ...ge to employing error-correcting codes as a distributed representation. Indeed, the idea of employing error-correcting, distributed representations can be traced to early research inmachine learning (=-=Duda, Machanik, & Singleton, 1963-=-). 264sError-Correcting Output Codes Table 1: A distributed code for the digit recognition task. Code Word Class vl hl dl cc ol or 0 0 0 0 1 0 0 1 1 0 0 0 0 0 2 0 1 1 0 1 0 3 0 0 0 0 1 0 4 1 1 0 0 0 0... |

1 |
Measuring the e ective number of dimensions during backpropagation training
- Weigend
- 1993
(Show Context)
Citation Context ...well as these other measures in producing accurate posterior probability estimates. Furthermore, cross-entropy and CFM tend to over t more easily than squared error (Lippmann, personal communication; =-=Weigend, 1993-=-). We chose to minimize squared error because this is what the CNAPS and opt systems implement. With either neural network algorithm, several parameters must be chosen by the user. For the CNAPS, we m... |

1 |
Error-Correcting Output Codes
- Kong
- 1995
(Show Context)
Citation Context ... 1000 words = 6 stresses 7,229 letters 7,242 letters dence, the error-correcting output code method would fail. We address this question|for the case of decision-tree algorithms|in a companion paper (=-=Kong & Dietterich, 1995-=-). 2. Methods This section describes the data sets and learning algorithms employed in this study. It also discusses the issues involved in the design of error-correcting codes and describes four algo... |

1 |
Measuring the effective number of dimensions during backpropagation training
- Weigend
- 1993
(Show Context)
Citation Context ...ell as these other measures in producing accurate posterior probability estimates. Furthermore, cross-entropy and CFM tend to overfit more easily than squared error (Lippmann, personal communication; =-=Weigend, 1993-=-). We chose to minimize squared error because this is what the CNAPS and opt systems implement. With either neural network algorithm, several parameters must be chosen by the user. For the CNAPS, we m... |

1 |
Measuring the eective number of dimensions during backpropagation training
- Weigend
- 1993
(Show Context)
Citation Context ...well as these other measures in producing accurate posterior probability estimates. Furthermore, cross-entropy and CFM tend to overt more easily than squared error (Lippmann, personal communication; =-=Weigend, 1993-=-). We chose to minimize squared error because this is what the CNAPS and opt systems implement. With either neural network algorithm, several parameters must be chosen by the user. For the CNAPS, we m... |