| Jordan, M.I. & Jacobs, R.A. (1992). Hierarchies of adaptive experts. In J.E. Moody, S. Hanson & R.P. Lippmann, (Eds.), Advances in Neural Information Processing System 4. San Mateo: Morgan Kaufmann, 985-992. |
....of multiple models methods. There is special interest in the development of clustering, classification, prediction and parameter estimation algorithms for time series ( dynamic ) problems. Some remarkable efforts in this direction include partition algorithms [10, 19] mixtures of experts [5, 12, 13, 14, 15, 16, 25], ensembles of neural networks [3, 7, 26] trees of neural networks [17, 32] threshold models [35] Takagi Sugeno fuzzy models [34] and much more. For an extensive bibliographical coverage see the books [22, 31] The predictor architecture proposed in this paper is modular in the sense that it ....
M.I. Jordan and R.A. Jacobs, "Hierarchies of Adaptive Experts", in Neural Information Processing Systems 4, 1992, J. Moody, S. Hanson and R. Lipproart (eds.), San Mateo, CA, Morgan Kaufmann.
....regarding the convergence properties of the posteriors, will be presented elsewhere. The basic result is that, if certain mild identifiability conditions are satisfied, we can prove convergence to the true or best model with probability one. Our work has much in common with, for example, [6, 7, 9, 10, 21]. In [21] an hierarchical architecture is presented which is very similar to our Partition Network, with a gating network (similar to our decision module) combining the outcomes of local expert networks (corresponding to our predictor modules) The same idea is used by Jordan et al. in [6, 7, 9] ....
....depends in an essential way on the dynamic behavior of both the data and the posterior probabilities. This is particularly obvious in classification tasks with source switching, as will be explained in later sections. Secondly, PA learns by explicit application of Bayes rule, whereas in [6, 7, 9, 10, 21] learning is formulated as a Maximum Likelihood problem which requires the use of some approximate optimization algorithm. Thirdly, we perform classification using the criterion of predictive power. A fourth difference is that, in [6, 7, 9, 10, 21] the emphasis is on actually learning the ....
[Article contains additional citation context not shown here]
M.I. Jordan and R.A. Jacobs, Hierarchies of Adaptive Experts, in Neural Information Processing Systems J, 1992, J. Moody, S. Hanson and R. Lippman (eds.), San Mateo, CA, Morgan Kaufinann.
....neural networks perform the e#ective calculation at each assigned region separately. An extension of this approach is the hierarchical mixture of experts method, where the outputs of the di#erent experts are non linearly combined by di#erent supervisor gating networks hierarchically organized [64, 65, 59]. Cohen and Intrator extended the idea of constructing local simple base learners for di#erent regions of input space, searching for appropriate architectures that should be locally used and for a criterion to select a proper unit for each region of input space [24, 25] They proposed a hybrid ....
M. Jordan and R. Jacobs. Hierarchies of adaptive experts. In Advances in Neural Information Processing Systems, volume 4, pages 985--992. Morgan Kau#man, San Mateo, CA, 1992.
....of two classification systems, which are characterized of symbolic and sub symbolic processing capability respectively. Yet another point is the modularity of the hybrid system. Although the modular neural networks have received attentions of many researchers in recent years, for example, in [91, 92], however, integrating symbolic knowledges and prior knowledge [93] 94] with classical neural networks has only been applied for speech recognition. There has not been explicit integration of symbolic representation and sub symbolic representation for solving lower level image recognition tasks. ....
M. I. Jordan and R. A. Jacobs, "Hierarchies of adaptive experts," in Advances in Neural Information Processing Systems 4. Proceedings of the 1991.
....localisation of objects [see also 103] and control of a multi payload robot [see also 102] In all these applications the ME models were trained by direct gradient ascent of the log likelihood. The first description of a hierarchical mixtures of experts model was given by Jordan and Jacobs [111]. The HME was applied to a binary image classification problem and a system identification problem which involved learning the simulated dynamics of a four joint, three dimensional robot arm (prediction of joint accelerations from joint positions, velocities and torques) Both the experts and ....
Jordan, M. I. and Jacobs, R. A. [1992], Hierarchies of adaptive experts, in J. E. Moody, S. J. Hanson and R. P. Lippmann, eds, `Advances in Neural Information Processing Systems 4', Morgan Kaufmann, San Mateo, California, pp. 985--992.
....of multiple models methods. There is special interest in the development of clustering, classification, prediction and parameter estimation algorithms for time series ( dynamic ) problems. Some remarkable e#orts in this direction include partition algorithms [10, 19] mixtures of experts [5, 12, 13, 14, 15, 16, 25], ensembles of neural networks [3, 7, 26] trees of neural networks [17, 32] threshold models [35] Takagi Sugeno fuzzy models [34] and much more. For an extensive bibliographical coverage see the books [22, 31] The predictor architecture proposed in this paper is modular in the sense that it ....
M.I. Jordan and R.A. Jacobs, "Hierarchies of Adaptive Experts", in Neural Information Processing Systems 4, 1992, J. Moody, S. Hanson and R. Lippman (eds.), San Mateo, CA, Morgan Kaufmann.
....into an ensemble method with constant coecients, because the coecients are forced towards a common average. We conclude that the optimal value of can give valuable information about the training set. 4 The Family of Gradient Descent ME Methods DynCo is a variation on the ME method presented in [6, 7, 10]. There are a number of similarities: A group of predictors (experts) are combined using weighting with non constant coecients (called a gating network in the lit 1 The training sets are described in section 5. 9 erature) In all methods the experts are trained using gradient descent. The ....
....are to be found in the error function and the architecture. In [6] the error function for each expert is the MSE function. This encourages competition (see below) The error function for the gating network is an ad hoc function that does not guarantee normalized coecients. The methods in [7, 10] and DynCo use the SOFTMAX function to obtain auto normalization of the coe cients. In [7] Jacobs et al. three di erent error functions are discussed. The two rst are MSE functions for respectively the combined predictor and experts. The rst error function ( y P j c j f j ) 2 ) is the ....
[Article contains additional citation context not shown here]
Jordan, M. I., and Jacobs, R. A. Hierarchies of adaptive experts. In Advances in Neural Information Processing Systems (1992), J. E. Moody, S. J. Hanson, and R. P. Lippmann, Eds., vol. 4, Morgan Kaufmann Publishers, Inc., pp. 985-992.
....training set. The results for the XuME method do not seem to follow the interpretation of (see section 3.6) and we conclude that the information is method dependent. 3.5. 3 DynCo Compared to Known Methods DynCo is, as mentioned, a ME method and is related to methods presented in [12] 13] and [16]. There are a number of similarities: A group of predictors are combined using weighting with non constant coecients (or gating network ) All the di erent combined predictors are trained using gradient descent. The di erences are to be found in the error function and the architecture. In [12] ....
....space towards one for good experts, towards zero for bad experts, and towards 1 no. of experts for mediocre experts. At the same time, the error function forces the coecients towards either one or zero and the sum of the coecients towards one. In contrast: DynCo, the methods in [13] and [16] use the SOFTMAX function to obtain auto normalization of the coecients. In [13] Jacobs et al. three di erent error functions are discussed. The rst is (y P j c j f j ) 2 , which (apart from a constant) is the error function used in DynCo. Jacobs et al. don t consider this a well suited ....
[Article contains additional citation context not shown here]
Jordan, M. I., and Jacobs, R. A. Hierarchies of adaptive experts. In Advances in Neural Information Processing Systems (1992), J. E. Moody, S. J. Hanson, and R. P. Lippmann, Eds., vol. 4, Morgan Kaufmann Publishers, Inc., pp. 985-992.
....the complexity of the remaining problem. What is left are the two U s in gures 12 and 18 and the top left part of the expert in 16, where the output is constant. 3.5. 4 DynCo Compared to Known Methods DynCo is, as mentioned, a ME method and is related to methods presented in [14] 15] and [18]. There are a number of similarities: A group of predictors are 13 Figure 16: Figure 17: Figure 18: Figure 19: combined using weighting with non constant coecients (or gating network ) All the di erent combined predictors are trained using gradient descent. The di erences are to be found in ....
....space towards one for good experts, towards zero for bad experts, and towards 1 no. of experts for mediocre experts. At the same time, the error function forces the coecients towards either one or zero and the sum of the coecients towards one. In contrast DynCo, the methods in [15] and [18] use the SOFTMAX function to obtain auto normalization of the coecients. In [15] Jacobs et al. three di erent error functions are discussed. The rst is (y P j c j f j ) 2 , which (apart from a constant) is the error function used in DynCo. Jacobs et al. don t consider this a well suited ....
[Article contains additional citation context not shown here]
Jordan, M. I., and Jacobs, R. A. Hierarchies of adaptive experts. In Advances in Neural Information Processing Systems (1992), J. E. Moody, S. J. Hanson, and R. P. Lippmann, Eds., vol. 4, Morgan Kaufmann Publishers, Inc., pp. 985-992. 31
....the complexity of the remaining problem. What is left are the two U s in gures 12 and 18 and the top left part of the expert in 16, where the output is constant. 3.5. 3 DynCo Compared to Known Methods DynCo is, as mentioned, a ME method and is related to methods presented in [14] 15] and [18]. There are a number of similarities: A group of predictors are combined using weighting with non constant coecients (or gating network ) All the di erent combined predictors are trained using gradient descent. The di erences are to be found in the error function and the architecture. In [14] ....
....space towards one for good experts, towards zero for bad experts, and towards 1 no. of experts for mediocre experts. At the same time, the error function forces the coecients towards either one or zero and the sum of the coecients towards one. In contrast DynCo, the methods in [15] and [18] use the SOFTMAX function to obtain auto normalization of the coecients. In [15] Jacobs et al. three di erent error functions are discussed. The rst is (y P j c j f j ) 2 , which (apart from a constant) is the error function used in DynCo. Jacobs et al. don t consider this a well suited ....
[Article contains additional citation context not shown here]
Jordan, M. I., and Jacobs, R. A. Hierarchies of adaptive experts. In Advances in Neural Information Processing Systems (1992), J. E. Moody, S. J. Hanson, and R. P. Lippmann, Eds., vol. 4, Morgan Kaufmann Publishers, Inc., pp. 985-992.
.... et al. 1994) The mixture of experts approach has been extended to a recursively defined hierarchical mixture of experts (HME) architecture in which a tree of gating networks combines the expert networks into successively larger groupings that are defined over nested regions of the input space (Jordan and Jacobs, 1992). A maximum likelihood learning algorithm for the HME architecture has been derived (Jordan and Jacobs, 1994) based on the Expectation Maximization (EM) principle from statistics (Dempster et al. 1977) The multiple experts model has been extended to deal with unsupervised learning in which the ....
Jordan, M.I., & Jacobs, R.A. (1992). Hierarchies of adaptive experts. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural information processing (pp. 985--993). San Mateo, CA: Morgan Kaufmann.
....an additional network which we call the classification network or the C Net, as illustrated in Fig.1. The C Net classifies the given views by directly partitioning the input space. This type of modular architecture has been proposed by Jacobs et al. 1991) based on a stochastic model (see also Jordan Jacobs, 1992). In this architecture, the final output, y, is given by y = X i g i y i (3) where y i denotes the output of the i th I Net, and g i is given by the softmax function g i = exp [s i ] X j exp [s j ] 4) where s i is the weighted sum arriving at the i th output unit of the C Net. For the ....
Jordan, M. I. and Jacobs, R. A. (1992). Hierarchies of adaptive experts. In Moody, J. E., Hanson, S. J. & Lippmann, R. P., (eds), Advances in Neural Information Processing Systems 4. Morgan Kaufmann Publishers, San Mateo, CA. 985-992.
....Further localization is achieved by giving higher learning rates to the better performing expert, on each pattern. This idea was later extended into a tree structure termed Hierarchical Mixture of Experts (HME) in which experts may be built from lower level experts and gating functions (Jordan and Jacobs, 1992). In later work, the EM algorithm was used for training the HME (Jordan and Jacobs, 1994) Waterhouse and Robinson describe how to gradually grow these recursive learning machines (Waterhouse and Robinson, 1996) The Mixture of Experts procedure achieves superior generalization and fast learning ....
Jordan, M. I. and Jacobs, R. A. (1992). Hierarchies of adaptive experts. In Moody, J. E., Hanson, S. J., and Lippmann, R. P., editors, Advances in Neural Information Processing Systems, volume 4, pages 985--992. Morgan Kaufmann, San Mateo, CA.
....of the input space. This is in sharp contrast to the hard segmenting discussed earlier for our superresolution methodology. Two gating network structures are reported in the literature: a single layer gating network [Jacobs et al. 1991; Jacobs and Jordan, 1991] and a hierarchical gating network [Jordan and Jacobs, 1992, 1994] Both 101 of these structures model P(c x) as previously discussed but the hierarchical gating structure provides a greater flexibility into how the input space of x is soft segmented. The gating function, in our methodology, would be expressed as = g = otherwise c c P 0 ) 1 ....
Jordan, M.I. and Jacobs, R.A. (1992), "Hierarchies of Adaptive Experts," Advances in Neural and Information Processing Systems 4, NIPS-4, pp. 985-992.
.... examples of the use of GA and EM methods to nd maximum likelihood estimates, in relation to the shared mixture model classi er, are given in [8] Alternative derivations and estimation procedures for the mixture of experts classi er will be presented in [7] Other methodologies can be found in [3, 9, 10, 13, 14]. 4. PROCESSING REGIONS OF INTEREST Pixel level processing generates detections, many of which are false alarms. The purpose of ROI level processing is to remove the major proportion of these false alarms by processing a region of interest (ROI) surrounding each pixel level detection. The number ....
M. I. Jordan and R. A. Jacobs. Hierarchies of adaptive experts. In D. Touretzky, editor, Advances in neural information processing systems. Morgan Kaufmann, 1992.
.... the gating module is adjusted so that the a priori probability of selecting each Q module becomes equal to the a posteriori probability of selecting that 3 The interested reader is referred to the descriptions and derivations of GMM in Jacobs et al. 56] Nowlan [81] and Jordan and Jacobs [58]. 73 Q module, given the estimated desired output. Because of the different initial values of the free parameters in the different Q modules, over time, different Q modules start winning the competition for different elemental tasks, and the gating module learns to select the appropriate ....
M.I. Jordan and R.A. Jacobs. Hierarchies of adaptive experts. In J.E. Moody, S.J. Hanson, and R.P. Lippman, editors, Advances in Neural Information Processing Systems 4, pages 985--992. Morgan Kaufmann, 1992.
No context found.
Jordan, M.I. & Jacobs, R.A. (1992). Hierarchies of adaptive experts. In J.E. Moody, S. Hanson & R.P. Lippmann, (Eds.), Advances in Neural Information Processing System 4. San Mateo: Morgan Kaufmann, 985-992.
No context found.
Jordan, M. I., & Jacobs, R. A. (1992). Hierarchies of adaptive experts. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in Neural Information Processing Systems 4. San Mateo, CA: Morgan Kaufmann. pp. 985-993.
....is to use piecewise linear approximations to nonlinear functions. This approach generally requires all of the data to be stored so that the piecewise fits can be constructed on the fly (Atkeson, 1990) It is also possible to treat the problem of splitting the space as part of the learning problem (Jordan Jacobs, 1992). Another large class of algorithms are both nonlinear in the inputs and nonlinear in the parameters. These algorithms include the generalized splines (Wahba, 1990, Poggio Girosi, 1990) the feedforward neural network (Hinton, 1989) and regression trees (Breiman, Friedman, Olshen, Stone, ....
....of algorithms are both nonlinear in the inputs and nonlinear in the parameters. These algorithms include the generalized splines (Wahba, 1990, Poggio Girosi, 1990) the feedforward neural network (Hinton, 1989) and regression trees (Breiman, Friedman, Olshen, Stone, 1984; Friedman, 1990; Jordan Jacobs, 1992). For example, the standard two layer feedforward neural network can be written in the form: y i = f( # j w ij f( # k v jk x k ) 39) where the parameters w ij and v jk are the weights of the network and the function f is a fixed nonlinearity. Because the weights v jk appear inside the ....
[Article contains additional citation context not shown here]
Jordan, M. I., & Jacobs, R. A. (1992). Hierarchies of adaptive experts. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in Neural Information Processing Systems 4, pp. 985-993. San Mateo, CA: Morgan Kaufmann.
....(EM) algorithm is an iterative approach to maximum likelihood parameter estimation. Jordan and Jacobs (1994) recently proposed an EM algorithm for the mixture of experts architecture of Jacobs, Jordan, Nowlan and Hinton (1991) and the hierarchical mixture of experts architecture of Jordan and Jacobs (1992). They showed empirically that the EM algorithm for these architectures yields significantly faster convergence than gradient ascent. In the current paper we provide a theoretical analysis of this algorithm. We show that the algorithm can be regarded as a variable metric algorithm with its ....
....architecture for supervised learning. The architecture involves a set of function approximators ( expert networks ) that are combined by a classifier ( gating network ) These networks are trained simultaneously so as to split the input space into regions where particular experts can specialize. Jordan and Jacobs (1992) extended this approach to a recursively defined architecture in which a tree of gating networks combine the expert networks into successively larger groupings that are defined over nested regions of the input space. This hierarchical mixture of experts (HME) architecture is closely related to ....
[Article contains additional citation context not shown here]
Jordan, M.I. & Jacobs, R.A. (1992). Hierarchies of adaptive experts. In J.E. Moody, S. Hanson & R.P. Lippmann, (Eds.), Advances in Neural Information Processing System 4. San Mateo: Morgan Kaufmann, 985-992.
....may also play a role. Appendix For completeness, this appendix provides the equations governing learning in the modular architecture. These equations and a fuller discussion of them may be found in Jacobs, Jordan, Nowlan, and Hinton (1991) Jacobs and Jordan (1991) Nowlan and Hinton (1991) and Jordan and Jacobs (1992). The parameters of the expert and gating networks are adjusted simultaneously using the backpropagation algorithm (Rumelhart, Hinton, and Williams, 1986) so as to maximize the objective function ln L = ln n X i=1 g i oe i e Gamma 1 2oe 2 i ky Gammay i k 2 (5) where y denotes ....
Jordan, M. I. & Jacobs, R. A. (1992) Hierarchies of adaptive experts. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in Neural Information Processing Systems 4. San Mateo, CA: Morgan Kaufmann Publishers.
No context found.
M. I. Jordan and R. A. Jacobs. Hierarchies of adaptive experts. In Advances in Neural Information Processing Systems 4, pages 985--992. Morgan Kaufmann, San Mateo, CA, 1992. 137
No context found.
M. I. Jordan and R. A. Jacobs. Hierarchies of adaptive experts. In J. E. Moody, S. J. Hanson, and R. P. Lippmann, editors, Advances in Neural Information Processing Systems 4, pages 985-993. Morgan Kaumann, 1992.
No context found.
, 77--97. Jordan, M., & Jacobs, R. (1992). Hierarchies of Adaptive Experts. In Moody, J., Hanson, S., & Lippman, R. (Eds.), Advances in Neural Information Processing Systems, Vol. 4, pp.
No context found.
IEEE Press. Jordan, M. I. and Jacobs, R. A. (1992). Hierarchies of Adaptive Experts. In Moody, J. E., Hanson, S. J., and Lippmann, R. P., editors, Advances in Neural Information Processing Systems 4, pages 985--992. Morgan Kaufmann, San Mateo, CA.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC