| S. R. Waterhouse and A. J. Robinson. Classification using hierarchical mixtures of experts. In IEEE Workshop on Neural Networks for Signal Processing 4, pages 177--186, 1994. |
....sites. Several methods for integrating ensembles of models have been studied, including techniques that combine the set of models in some linear fashion [1, 2, 3, 12, 20, 27, 29, 37, 39, 21] techniques that employ referee functions to arbitrate among the predictions generated by the classifiers, [16, 17, 18, 19, 28, 36], methods that rely on principal components analysis [23, 24] or methods that apply inductive learning techniques to learn the behavior and properties of the candidate classifiers [6, 40] Constructing ensembles of classifiers is not cheap and produces a final outcome that is expensive due to the ....
Waterhouse S. R. and Robinson A. J. Classification using hierarchical mixtures of experts. In IEEE Workshop on Neural Networks for Signal Processing IV, pages 177--186, 1994.
....of hidden Markov models (HMM s) an important example of predictive data modeling. 4 EM generally converges rapidly in this setting. Similarly, in the case of hierarchical mixtures of experts the empirical results on convergence in likelihood have been quite promising (Jordan Jacobs, 1994; Waterhouse Robinson, 1994). Finally, EM can play an important conceptual role as an organizing principle in the design of learning algorithms. Its role in this case is to focus attention on the missing variables in the problem. This clarifies the structure of the algorithm and invites comparisons with statistical ....
Waterhouse, S. R., and Robinson, A. J., (1994), Classification using hierarchical mixtures of experts, in IEEE Workshop on Neural Networks for Signal Processing.
....binary tree hierarchy were employed. It is seen that, deeper the network, the better it performs. The same phenomena is top R R 2 1 x x 1 2 1 2 3 4 5 b b b b b b Figure 2: Regions of expertise determined by a softmax based gating network. also reported in classification type HME networks [10] [11]. Unfortunately, the computational cost increases as the tree height increases, as more gating and expert networks need to be trained. The depth of the tree also leads to the question of what is the right size and structure of the tree to best solve a problem, i.e. one is faced with the model ....
S. R. Waterhouse and A. J. Robinson. Classification using hierarchical mixture of experts. In Neural Networks for Signal Processing IV: Proceedings of the IEEE Workshop, pages 177--186. IEEE Press, New York, 1994.
....The divide and conquer approach has shown particularly useful in attributing experts to different regimes in piece wise stationary time series [9] and modeling discontinuities in the input output mapping. Mixtures of experts have also been successfully applied to classification problems [4][8], though a proof that minimization of the ME error function (based on the formulation as a mixture model) leads to ME outputs estimating the a posteriori probabilities of class membership, is still lacking. The purpose of this paper is to show that at the global minimum of this ME error function, ....
S. R. Waterhouse and A. J. Robinson. Classification using hierarchical mixtures of experts. In Proceedings 1994 IEEE Workshop on Neural Networks for Signal Processing, pages 177--186, Long Beach CA, 1994. IEEE Press.
....the fact that hybrid HMM ANN systems are based on posterior probabilities makes it easier to merge multiple recognizers, each of them having different properties. Also, advanced techniques initially developed in the framework of neural networks to recombine statistical experts (mixture of experts [22]) can also be used. Many (relatively simple) speech recognition systems based on this hybrid HMM ANN approach, have been proved, on controlled tests, to be both effective in terms of accuracy (comparable or better than equivalent state of the art systems) and efficient in terms of CPU and memory ....
S. Waterhouse and A. Robinson, "Classification using hierarchical mixtures of experts," in Proc. 1994 IEEE Workshop on Neural Networks for Signal Processing, pp. IV--177--186, 1994.
....which perform local function approximation. The expert outputs y i (t) are combined with the outputs g i (t) of a gate to form the overall output y(t) X i g i (t) y i (t) 7. 1) In the case of classification, the experts compute vectors of class conditional probabilities as observed in [99]. CHAPTER 7. THE MULTIPLE UNKNOWN MICROPHONE TASK 88 Gate y(t) 2 g(t) 2 i S LIN 2 LIN 1 RNN RNN y(t) 1 u(t) u(t) u(t) g(t) 1 y(t) g(t)y(t) i S Figure 7.4: A mixture of 2 linear input networks (LIN) for adapting the recurrent network (RNN) Figure 7.4 shows the mixture of linear ....
S. R. Waterhouse and A. J. Robinson. Classification using Hierarchical Mixtures of Experts. In IEEE Workshop on Neural Networks for Signal Processing, pages 177--186, 1994.
....sinc function. A simple mixture of experts network consisting of 16 experts, a 4 level binary tree hierarchy and an 8 level binary tree hierarchy were employed. It is seen that, deeper the network, the better it performs. The same phenomena is also reported in classification type HME networks [3] [4]. Unfortunately, the computational cost increases as the tree height increases, as more gating and expert networks need to be trained. The depth of the tree also leads to the question of what is the right size and structure of the tree to best solve a problem, i.e. one is faced with the model ....
S. R. Waterhouse and A. J. Robinson. Classification using hierarchical mixture of experts. In Neural Networks for Signal Processing IV: Proceedings of the IEEE Workshop, pages 177--186. IEEE Press, New York, 1994.
....an output vector ij for each input vector. These vectors proceed up the tree, judged by the gating networks which have the task to decide which expert connected produces the best output. In this discussion we focus on classification tasks (a complete description of the algorithm can be found in [3]) but there is also a regression approach which can for example be found in [2] All learning algorithms used here are supervised algorithms using a set of paired observations (x (t) y (t) with observation t in 1. T) The HME net can be viewed as modeling the probabilistic mapping of ....
S. Waterhouse, A. Robinson, Classification using Hierarchical Mixture of Experts, Proc. 1994 IEEE Workshop on Neural Networks for Signal Processing IV pp.177-186
....sinc function. A simple mixture of experts network consisting of 16 experts, a 4 level binary tree hierarchy and an 8 level binary tree hierarchy were employed. It is seen that, deeper the network, the better it performs. The same phenomena is also reported in classification type HME networks [17] [18]. Unfortunately, the computational cost increases as the tree height increases, as more gating and expert networks need to be trained. The depth of the tree also leads to the question of what is the right size and structure of the tree to best solve a problem, i.e. one is faced with the model ....
S. R. Waterhouse and A. J. Robinson. Classification using hierarchical mixture of experts. In Neural Networks for Signal Processing IV: Proceedings of the IEEE Workshop, pages 177--186. IEEE Press, New York, 1994.
....each mapping a particular portion of the input space, are combined in a probabilistic way by gating net which is modeling the probability that each portion of the input space generated the output. The HME has been successful in a number of regression and some classification problems [9] [11], yielding significantly faster training through the use of the Expectation Maximization (EM) algorithm. In addition, it is also applied successfully to the non linear prediction of acoustic vectors for speech processing [12] In our previous work, we have already applied HME along with EM ....
S.R. Waterhouse and A.J. Robinson, "Classification using hierarchical mixtures of experts," Proceedings of IEEE Conference on Neural Networks and Signal Processing, 1994.
....to approximate a sinc function. A simple mixture of experts network consisting of 16 experts, a 4 level binary and an 8 level binary tree hierarchies were employed. It is seen that, deeper the network, the better it performs. The same phenomena is also reported in classification type HME networks [5][4] Unfortunately, the computational cost increases as the tree height increases, as more gating and expert networks need to be trained. The depth of the tree also leads to the question of what is the right size and structure of the tree to best solve a problem, i.e. one is faced with the model ....
S. R. Waterhouse and A. J. Robinson. Classification using hierarchical mixture of experts. In Neural Networks for Signal Processing IV: Proceedings of the IEEE Workshop, pages 177--186. IEEE Press, New York, 1994.
....and attributes expert networks to these different regions. The divide and conquer approach has shown particularly useful in attributing experts to different regimes in piece wise stationary time series [12] modeling discontinuities in the input output mapping, and classification problems [4][11]. The ME error function is based on the interpretation of MEs as a mixture model [9] with conditional densities as mixture components (for the experts) and gating network outputs as mixing coefficients. This error function is in fact a generalization of the sum of squares and cross entropy error ....
S. R. Waterhouse and A. J. Robinson. Classification using hierarchical mixtures of experts. In Proceedings 1994 IEEE Workshop on Neural Networks for Signal Processing, pages 177--186, Long Beach CA, 1994. IEEE Press.
.... Jordan, 1993) where it takes the form of an adaptive gain scheduling controller. In related work, a variant of the ME architecture has been applied by Cacciatore and Nowlan (1994) to the control of jump linear systems. There have been applications of the HME architecture to speech recognition (Waterhouse Robinson, 1994); see also Hampshire and Waibel (1989) A theoretical analysis of the ME and HME architectures has been provided by Jordan Xu (in press) these authors present results on the convergence rates of the EM algorithm. Jordan (1994) discusses the model selection problem for HME architectures. ....
Waterhouse, S. R., and Robinson, A. J., 1994, Classification using hierarchical mixtures of experts, in IEEE Workshop on Neural Networks for Signal Processing.
....reduced training times. In the EM formulation, the M step is solved by employing the Iteratively re weighted least squares (IRLS) technique. The IRLS is a multi pass algorithm. An efficient one pass algorithm was presented to solve the M step of the EM algorithm for regression problems [1] In [3], Waterhouse and Robinson extended the HME model to solve classification problems. But they employ the IRLS method to solve the M step. In this paper, an efficient one pass algorithm is described to solve the M step for classification problems. The gating network in the mixture of experts model is ....
....of a Gaussian density function) Under this assumption, the IRLS method to determine the expert network parameters reduces directly to the standard, one pass weighted least squares problem. They obtain an approximate weighted least squares formulation to determine the gating network parameters. In [3], a Multinomial Logit Probability model is employed for classification problems. They use the IRLS technique to solve the M step. Here, a one pass algorithm to solve the M step is presented for classification problems. As in [3] a Mulitinomial logit probability model is assumed for classification ....
[Article contains additional citation context not shown here]
S. R. Waterhouse and A. J. Robinson. Classification using hierarchical mixture of experts. In Neural Networks for Signal Processing IV: Proceedings of the IEEE Workshop, pages 177--186. IEEE Press, New York, 1994.
....and attributes expert networks to these different regions. The divide and conquer approach has shown particularly useful in attributing experts to different regimes in piece wise stationary time series [20] modeling discontinuities in the input output mapping, and classification problems [6] 13][19]. The ME error function is based on the interpretation of MEs as a mixture model [12] with conditional densities as mixture components (for the experts) and gating network outputs as mixing coefficients. The purpose of this note is to describe various existing methods for minimizing this ME error ....
S. R. Waterhouse and A. J. Robinson. Classification using hierarchical mixtures of experts. In Proceedings 1994 IEEE Workshop on Neural Networks for Signal Processing, pages 177--186, Long Beach CA, 1994. IEEE Press.
....of hidden Markov models (HMM s) an important example of predictive data modeling. 4 EM generally converges rapidly in this setting. Similarly, in the case of hierarchical mixtures of experts the empirical results on convergence in likelihood have been quite promising (Jordan Jacobs, 1994; Waterhouse Robinson, 1994). Finally, EM can play an important conceptual role as an organizing principle in the design of learning algorithms. Its role in this case is to focus attention on the missing variables in the problem. This clarifies the structure of the algorithm and invites comparisons with statistical ....
Waterhouse, S. R., and Robinson, A. J., (1994), Classification using hierarchical mixtures of experts, in IEEE Workshop on Neural Networks for Signal Processing.
....The length of this thesis is approximately 62,000 words, including appendices and bibliography. Some of the work in this thesis has previously been published in: Waterhouse and Cook [233] Waterhouse, Kershaw and Robinson [234] Waterhouse, MacKay and Robinson [235] Waterhouse and Robinson [236], Waterhouse and Robinson [237] and Waterhouse and Robinson [238] Also, some of the chapters, were inspired by joint work with David MacKay, Gary Cook and Dan Kershaw. The level of collaboration was as follows. Chapter 7: Bayesian Methods for Mixtures of Experts. I first thought of using ....
....logit function (or softmax function [23] which is encapsulated in a multinomial logit model, is an appropriate one for dichotomous or polytomous classification. The use of multinomial logit models as experts in an HME model was first suggested by Jordan and Jacobs [112] Waterhouse and Robinson [236] subsequently applied this model to simulated classification tasks. Peng et al. 172] also used this form of model on experiments with the Peterson Barney data but employed a Bayesian inference method via Gibbs sampling. Alpaydin and Jordan [1] used the multinomial logit for the experts in their ....
[Article contains additional citation context not shown here]
Waterhouse, S. R. and Robinson, A. J. [1994], Classification using hierarchical mixtures of experts, in `Proceedings of the IEEE Workshop on Neural Networks for Signal Processing', IEEE Press, Long Beach, CA, pp. 177--186.
....The length of this thesis is approximately 62,000 words, including appendices and bibliography. Some of the work in this thesis has previously been published in: Waterhouse and Cook [233] Waterhouse, Kershaw and Robinson [234] Waterhouse, MacKay and Robinson [235] Waterhouse and Robinson [236], Waterhouse and Robinson [237] and Waterhouse and Robinson [238] Also, some of the chapters, were inspired by joint work with David MacKay, Gary Cook and Dan Kershaw. The level of collaboration was as follows. Chapter 7: Bayesian Methods for Mixtures of Experts. I first thought of using ....
....logit function (or softmax function [23] which is encapsulated in a multinomial logit model, is an appropriate one for dichotomous or polytomous classification. The use of multinomial logit models as experts in an HME model was first suggested by Jordan and Jacobs [112] Waterhouse and Robinson [236] subsequently applied this model to simulated classification tasks. Peng et al. 172] also used this form of model on experiments with the Peterson Barney data but employed a Bayesian inference method via Gibbs sampling. Alpaydin and Jordan [1] used the multinomial logit for the experts in their ....
[Article contains additional citation context not shown here]
Waterhouse, S. R. and Robinson, A. J. [1994], Classification using hierarchical mixtures of experts, in `Proceedings of the IEEE Workshop on Neural Networks for Signal Processing', IEEE Press, Long Beach, CA, pp. 177--186.
....of a set of experts which perform local function approximation. The expert outputs y i (t) are combined with the outputs g i (t) of a gate to form the overall output y(t) X i g i (t)y i (t) 1) In the case of classification, the experts compute vectors of class conditional probabilities [7]. In Figure 2, we show the mixture of linear input networks (MLIN) architecture. Each expert consists of a LIN and a recurrent network. The gate consists of a single layer network with softmax activation function which computes the conditional probability of selecting each expert given the current ....
S. R. Waterhouse and A. J. Robinson. Classification using hierarchical mixtures of experts. In IEEE Workshop on Neural Networks for Signal Processing, pages 177--186, 1994.
....was used to solve the 8 bit parity classification task. As can be seen in Figures 4(a) and (b) the factorisation enabled by the freezing algorithm significantly speeds up computation over the standard EM method. The final tree shape is also shown in Figure 4(c) We showed in an earlier paper (Waterhouse Robinson 1994) that the XOR problem may be solved using at least 2 experts and a gate. The 8 bit parity problem is therefore being solved by a series of XOR classifiers, each gated by its parent node, which is an intuitively appealing form with an efficient use of parameters. The algorithm is sensitive to the ....
Waterhouse, S. R. & Robinson, A. J. (1994), Classification using hierarchical mixtures of experts, in `IEEE Workshop on Neural Networks for Signal Processing', pp. 177--186.
....such as the Auto Regressive (AR) model or via nonlinear models such as connectionist feed forward or recurrent networks. The HME overcomes a number of problems associated with traditional connectionist models via its architecture and statistical framework. Recently, Jordan Jacobs (1994) and Waterhouse Robinson (1994) have shown that via the EM algorithm and a 2nd order optimization scheme known as Iteratively Reweighted Least Squares (IRLS) the HME is faster than standard Multilayer Perceptrons (MLP) by at least an order of magnitude on regression and classification tasks respectively. Jordan Jacobs also ....
Waterhouse, S. R. & Robinson, A. J. (1994), Classification using hierarchical mixtures of experts, in `IEEE Workshop on Neural Networks for Signal Processing', pp. 177--186.
No context found.
S. R. Waterhouse and A. J. Robinson. Classification using hierarchical mixtures of experts. In IEEE Workshop on Neural Networks for Signal Processing 4, pages 177--186, 1994.
No context found.
Waterhouse, S.R., Robinson, A.J. (1994) Classification using Hierarchical Mixtures of Experts. In Proc. 1994 IEEE Workshop on Neural Networks for Signal Processing IV, pp. 177-186.
No context found.
Waterhouse, S.R., Robinson, A.J. (1994) Classification using Hierarchical Mixtures of Experts. In Proc. 1994 IEEE Workshop on Neural Networks for Signal Processing IV, pp. 177-186.
No context found.
Waterhouse, S. R., Robinson, A. J. (1994) Classification using Hierarchical Mixtures of Experts. IEEE Workshop on Neural Networks for Signal Processing IV, 177-186.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC