Results 1  10
of
154
An introduction to variational methods for graphical models
 TO APPEAR: M. I. JORDAN, (ED.), LEARNING IN GRAPHICAL MODELS
"... ..."
Training Products of Experts by Minimizing Contrastive Divergence
, 2002
"... It is possible to combine multiple latentvariable models of the same data by multiplying their probability distributions together and then renormalizing. This way of combining individual “expert ” models makes it hard to generate samples from the combined model but easy to infer the values of the l ..."
Abstract

Cited by 850 (75 self)
 Add to MetaCart
It is possible to combine multiple latentvariable models of the same data by multiplying their probability distributions together and then renormalizing. This way of combining individual “expert ” models makes it hard to generate samples from the combined model but easy to infer the values of the latent variables of each expert, because the combination rule ensures that the latent variables of different experts are conditionally independent when given the data. A product of experts (PoE) is therefore an interesting candidate for a perceptual system in which rapid inference is vital and generation is unnecessary. Training a PoE by maximizing the likelihood of the data is difficult because it is hard even to approximate the derivatives of the renormalization term in the combination rule. Fortunately, a PoE can be trained using a different objective function called “contrastive divergence ” whose derivatives with regard to the parameters can be approximated accurately and efficiently. Examples are presented of contrastive divergence learning using several types of expert on several types of data.
Loopy belief propagation for approximate inference: An empirical study. In:
 Proceedings of Uncertainty in AI,
, 1999
"... Abstract Recently, researchers have demonstrated that "loopy belief propagation" the use of Pearl's polytree algorithm in a Bayesian network with loops can perform well in the context of errorcorrecting codes. The most dramatic instance of this is the near Shannonlimit performanc ..."
Abstract

Cited by 676 (15 self)
 Add to MetaCart
(Show Context)
Abstract Recently, researchers have demonstrated that "loopy belief propagation" the use of Pearl's polytree algorithm in a Bayesian network with loops can perform well in the context of errorcorrecting codes. The most dramatic instance of this is the near Shannonlimit performance of "Turbo Codes" codes whose decoding algorithm is equivalent to loopy belief propagation in a chainstructured Bayesian network. In this paper we ask: is there something spe cial about the errorcorrecting code context, or does loopy propagation work as an ap proximate inference scheme in a more gen eral setting? We compare the marginals com puted using loopy propagation to the exact ones in four Bayesian network architectures, including two realworld networks: ALARM and QMR. We find that the loopy beliefs of ten converge and when they do, they give a good approximation to the correct marginals. However, on the QMR network, the loopy be liefs oscillated and had no obvious relation ship to the correct posteriors. We present some initial investigations into the cause of these oscillations, and show that some sim ple methods of preventing them lead to the wrong results. Introduction The task of calculating posterior marginals on nodes in an arbitrary Bayesian network is known to be NP hard In this paper we investigate the approximation performance of "loopy belief propagation". This refers to using the wellknown Pearl polytree algorithm [12] on a Bayesian network with loops (undirected cycles). The algorithm is an exact inference algorithm for singly connected networks the beliefs converge to the cor rect marginals in a number of iterations equal to the diameter of the graph.1 However, as Pearl noted, the same algorithm will not give the correct beliefs for mul tiply connected networks: When loops are present, the network is no longer singly connected and local propaga tion schemes will invariably run into trouble . We believe there are general undiscovered theorems about the performance of belief propagation on loopy DAGs. These theo rems, which may have nothing directly to do with coding or decoding will show that in some sense belief propagation "converges with high probability to a nearoptimum value" of the desired belief on a class of loopy DAGs Progress in the analysis of loopy belief propagation has been made for the case of networks with a single loop • Unless all the conditional probabilities are deter ministic, belief propagation will converge. • There is an analytic expression relating the cor rect marginals to the loopy marginals. The ap proximation error is related to the convergence rate of the messages the faster the convergence the more exact the approximation. • If the hidden nodes are binary, then thresholding the loopy beliefs is guaranteed to give the most probable assignment, even though the numerical value of the beliefs may be incorrect. This result only holds for nodes in the loop. In the maxproduct (or "belief revision") version, Weiss For the case of networks with multiple loops, Richard son To summarize, what is currently known about loopy propagation is that ( 1) it works very well in an error correcting code setting and (2) there are conditions for a singleloop network for which it can be guaranteed to work well. In this paper we investigate loopy prop agation empirically under a wider range of conditions. Is there something special about the errorcorrecting code setting, or does loopy propagation work as an approximation scheme for a wider range of networks? ..\ x(:x).) (1) where: and: The message X passes to its parent U; is given by: and the message X sends to its child Y j is given by: k;Cj For noisyor links between parents and children, there exists an analytic expression for 1r( x) and Ax ( u;) that avoids the exhaustive enumeration over parent config urations We made a slight modification to the update rules in that we normalized both ..\ and 1r messages at each iteration. As Pearl Nodes were updated in parallel: at each iteration all nodes calculated their outgoing messages based on the incoming messages of their neighbors from the pre vious iteration. The messages were said to converge if none of the beliefs in successive iterations changed by more than a small threshold (104). All messages were initialized to a vector of ones; random initializa tion yielded similar results, since the initial conditions rapidly get "washed out" . For comparison, we also implemented likelihood weighting 3.1 The PYRAMID network All nodes were binary and the conditional probabilities were represented by tablesentries in the conditional probability tables (CPTs) were chosen uniformly in the range (0, 1]. 3.2 The toyQMR network All nodes were binary and the conditional probabilities of the leaves were represented by a noisyor: ? (Child= OIParents) = eBoL; B,Paren t; where 110 represents the "leak" term. The QMRDT network The QMRDT is a bipartite network whose structure is the same as that shown in figure 2 but the size is much larger. There are approximately 600 diseases and ap proximately 4000 findin nodes, with a number of ob served findings that varies per case. Due to the form of the noisyor CPTs the complexity of inference is ex ponential in the number of positive findings Results Initial experiments The experimental protocol for the PYRAMID network was as follows. For each experimental run, we first gen erated random CPTs. We then sampled from the joint distribution defined by the network and clamped the observed nodes (all nodes in the bottom layer) to their sampled value. Given a structure and observations, we then ran three inference algorithms junction tree, loopy belief propagation and sampling. We found that loopy belief propagation always con verged in this case with the average number of iter ations equal to 10.2. The experimental protocol for the toyQMR network was similar to that of the PYRAMID network except that we randomized over structure as well. Again we found that loopy belief propagation always converged, with the average number of iterations equal to 8.65. The protocol for the ALARM network experiments dif fered from the previous two in that the structure and parameters were fixed only the observed evidence differed between experimental runs. We assumed that all leaf nodes were observed and calculated the pos Figure 2: The structure of a toyQMR network. This is a bipartite structure where the conditional distributions of the leaves are noisyor's. The network shown represents one sample from randomly generated structures where the parents of each symptom were a random subset of the diseases. terior marginals of all other nodes. Again we found that loopy belief propagation always converged with the average number of iterations equal to 14.55. The results presented up until now show that loopy propagation performs well for a variety of architectures involving multiple loops. We now present results for the QMRDT network which are not as favorable. In the QMRDT network there was no randomization. We used the fixed structure and calculated posteriors for the four cases for which posteriors have been cal culated exactly by Heckerman What causes convergence versus oscill ation? What our initial experiments show is that loopy prop agation does a good job of approximating the correct posteriors if it converges. Unfortunately, on the most challenging casethe QMRDT networkthe al gorithm did not converge. We wanted to see if this oscillatory behavior in the QMRDT case was related to the size of the network does loopy propagation tend to converge less for large networks than small networks? To investigate this question, we tried to cause oscil lation in the toyQMR network. We first asked what, besides the size, is different between toyQMR and real QMR? An obvious difference is in the parameter val ues while the CPTs for toyQMR are random, the real QMR parameters are not. In particular, the prior probability of a disease node being on is extremely low in the real QMR (typically of the order of 103 ). Would low priors cause oscillations in the toyQMR case? To answer this question we repeated the ex periments reported in the previous section but rather than having the prior probability of each node be ran domly selected in the range [0, 1] we selected the prior uniformly in the range [0, U] and varied U. Unlike the previous simulations we did not set the observed nodes by sampling from the joint for low priors all the findings would be negative and inference would be trivial. Rather each finding was independently set to positive or negative. If indeed small priors are responsible for the oscilla tion, then we would expect the real QMR network to converge if the priors were sampled randomly in the range [0, Small priors are not the only thing that causes oscil lation. Small weights can, too. The effect of both The exact marginals are represented by the circles; the ends of the "error bars" represent the loopy marginals at the last two iterations. We only plot the diseases which had nonnegligible posterior probability. Loopy Belief Propagation . s=o� . a' range of prior To test this hypothesis, we reparameterized the pyra mid network as follows: we set the prior probability of the "1" state of the root nodes to 0.9, and we utilized the noisyOR model for the other nodes with a small (0.1) inhibition probability (apart from the leak term, which we inhibited with probability 0.9). This param eterization has the effect of propagating 1 's from the top layer to the bottom. Thus the true marginal at each leaf is approximately (0.1, 0.9), i.e., the leaf is 1 with high probability. We then generated untypical evidence at the leaves by sampling from the uniform distribution, (0.5, 0.5), or from the skewed distribu tion (0.9, 0. 1). We found that loopy propagation still converged2, and that, as before, the marginals to which it converged were highly correlated with the correct marginals. Thus there must be some other explana tion, besides untypicality of the evidence, for the os cillations observed in QMR. Can we fix oscillations easily? When loopy propagation oscillates between two steady states it seems reasonable to try to find a way to com bine the two values. The simplest thing to do is to average them. Unfortunately, this gave very poor re sults, since the correct posteriors do not usually lie in the midpoint of the interval ( cf. 2More precisely, we found that with a convergence threshold of 104 , 98 out of 100 cases converged; when we lowered the threshold to 103 , all 100 cases converged. We also tried to avoid oscillations by using "momen tum"; replacing the messages that were sent at time t with a weighted average of the messages at times t and t1. That is, we replaced the reference to >.� ) in and similarly for 11"�) in Equation 3, where 0 :::; J.l :::; 1 is the momentum term. It is easy to show that if the modified system of equations converges to a fixed point F, then F is also a fixed point of the original system (since if>.� ) = >.�1) , then Equation 7 yields>.� ) ). In the experiments for which loopy propagation con verged (PYRAMID, toyQMR and ALARM), we found that adding the momentum term did not change the results the beliefs that resulted were the same be liefs found without momentum. In the experiments which did not converge (toyQMR with small priors and real QMR), we found that momentum significantly reduced the chance of oscillation. However, in several cases the beliefs to which the algorithm converged were quite inaccuratesee Discussion The experimental results presented here suggest that loopy propagation can yield accurate posterior marginals in a more general setting than that of error correcting coding the PYRAMID, toyQMR and ALARM networks are quite different from the error correcting coding graphs yet the loopy beliefs show high correlation with the correct marginals. In errorcorrecting codes the posterior is typically highly peaked and one might think that this feature is necessary for the good performance of loopy prop agation. Our results suggest that is not the case  in none of our simulations were the posteriors highly peaked around a single joint configuration. If the prob ability mass was concentrated at a single point the marginal probabilities should all be near zero or one; this is clearly not the case as can be seen in the figures. It might be expected that loopy propagation would only work well for graphs with large loops. However, our results, and previous results on turbo codes, show that loopy propagation can also work well for graphs with many small loops. At the same time, our experimental results suggest a cautionary note about loopy propagation, showing that the marginals may exhibit oscillations that have very little correlation with the correct marginals. We presented some preliminary results investigating the cause of the oscillations and showed that it is not sim ply a matter of the size of the network or the number of parents. Rather the same structure with different parameter values may oscillate or exhibit stable be havior. For all our simulations, we found that when loopy propagation converges, it gives a surprisingly good ap proximation to the correct marginals. Since the dis tinction between convergence and oscillation is easy to make after a small number of iterations, this may sug gest a way of checking whether loopy propagation is appropriate for a given problem. Acknowl edgements We thank Tommi Jaakkola, David Heckerman and David MacKay for useful discussions. We also thank Randy Miller and the University of Pittsburgh for the use of the QMRDT database. Supported by MURI ARO DAAH049610341. algorithm. These approaches are guaranteed to find local maxima, but do not explore the landscape for other modes. Our approach evolves structure and the missing data. We compare our stochastic algorithms and show they all produce accurate results.
The Infinite Hidden Markov Model
 Machine Learning
, 2002
"... We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. Th ..."
Abstract

Cited by 637 (41 self)
 Add to MetaCart
We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying statetransition matrix, and the expected number of distinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infiniteconsider, for example, symbols being possible words appearing in English text.
Independent Factor Analysis
 Neural Computation
, 1999
"... We introduce the independent factor analysis (IFA) method for recovering independent hidden sources from their observed mixtures. IFA generalizes and unifies ordinary factor analysis (FA), principal component analysis (PCA), and independent component analysis (ICA), and can handle not only square no ..."
Abstract

Cited by 277 (9 self)
 Add to MetaCart
(Show Context)
We introduce the independent factor analysis (IFA) method for recovering independent hidden sources from their observed mixtures. IFA generalizes and unifies ordinary factor analysis (FA), principal component analysis (PCA), and independent component analysis (ICA), and can handle not only square noiseless mixing, but also the general case where the number of mixtures differs from the number of sources and the data are noisy. IFA is a twostep procedure. In the first step, the source densities, mixing matrix and noise covariance are estimated from the observed data by maximum likelihood. For this purpose we present an expectationmaximization (EM) algorithm, which performs unsupervised learning of an associated probabilistic model of the mixing situation. Each source in our model is described by a mixture of Gaussians, thus all the probabilistic calculations can be performed analytically. In the second step, the sources are reconstructed from the observed data by an optimal nonlinear ...
A Variational Bayesian Framework for Graphical Models
 In Advances in Neural Information Processing Systems 12
, 2000
"... This paper presents a novel practical framework for Bayesian model averaging and model selection in probabilistic graphical models. Our approach approximates full posterior distributions over model parameters and structures, as well as latent variables, in an analytical manner. These posteriors ..."
Abstract

Cited by 267 (7 self)
 Add to MetaCart
This paper presents a novel practical framework for Bayesian model averaging and model selection in probabilistic graphical models. Our approach approximates full posterior distributions over model parameters and structures, as well as latent variables, in an analytical manner. These posteriors fall out of a freeform optimization procedure, which naturally incorporates conjugate priors. Unlike in large sample approximations, the posteriors are generally nonGaussian and no Hessian needs to be computed. Predictive quantities are obtained analytically. The resulting algorithm generalizes the standard Expectation Maximization algorithm, and its convergence is guaranteed. We demonstrate that this approach can be applied to a large class of models in several domains, including mixture models and source separation. 1 Introduction A standard method to learn a graphical model 1 from data is maximum likelihood (ML). Given a training dataset, ML estimates a single optimal value f...
The Bayesian Structural EM Algorithm
, 1998
"... In recent years there has been a flurry of works on learning Bayesian networks from data. One of the hard problems in this area is how to effectively learn the structure of a belief network from incomplete datathat is, in the presence of missing values or hidden variables. In a recent paper, I in ..."
Abstract

Cited by 260 (13 self)
 Add to MetaCart
(Show Context)
In recent years there has been a flurry of works on learning Bayesian networks from data. One of the hard problems in this area is how to effectively learn the structure of a belief network from incomplete datathat is, in the presence of missing values or hidden variables. In a recent paper, I introduced an algorithm called Structural EM that combines the standard Expectation Maximization (EM) algorithm, which optimizes parameters, with structure search for model selection. That algorithm learns networks based on penalized likelihood scores, which include the BIC/MDL score and various approximations to the Bayesian score. In this paper, I extend Structural EM to deal directly with Bayesian model selection. I prove the convergence of the resulting algorithm and show how to apply it for learning a large class of probabilistic models, including Bayesian networks and some variants thereof.
A Guide to the Literature on Learning Probabilistic Networks From Data
, 1996
"... This literature review discusses different methods under the general rubric of learning Bayesian networks from data, and includes some overlapping work on more general probabilistic networks. Connections are drawn between the statistical, neural network, and uncertainty communities, and between the ..."
Abstract

Cited by 204 (0 self)
 Add to MetaCart
This literature review discusses different methods under the general rubric of learning Bayesian networks from data, and includes some overlapping work on more general probabilistic networks. Connections are drawn between the statistical, neural network, and uncertainty communities, and between the different methodological communities, such as Bayesian, description length, and classical statistics. Basic concepts for learning and Bayesian networks are introduced and methods are then reviewed. Methods are discussed for learning parameters of a probabilistic network, for learning the structure, and for learning hidden variables. The presentation avoids formal definitions and theorems, as these are plentiful in the literature, and instead illustrates key concepts with simplified examples. Keywords Bayesian networks, graphical models, hidden variables, learning, learning structure, probabilistic networks, knowledge discovery. I. Introduction Probabilistic networks or probabilistic gra...
Learning Deep Architectures for AI
"... Theoretical results suggest that in order to learn the kind of complicated functions that can represent highlevel abstractions (e.g. in vision, language, and other AIlevel tasks), one may need deep architectures. Deep architectures are composed of multiple levels of nonlinear operations, such as i ..."
Abstract

Cited by 183 (30 self)
 Add to MetaCart
Theoretical results suggest that in order to learn the kind of complicated functions that can represent highlevel abstractions (e.g. in vision, language, and other AIlevel tasks), one may need deep architectures. Deep architectures are composed of multiple levels of nonlinear operations, such as in neural nets with many hidden layers or in complicated propositional formulae reusing many subformulae. Searching the parameter space of deep architectures is a difficult task, but learning algorithms such as those for Deep Belief Networks have recently been proposed to tackle this problem with notable success, beating the stateoftheart in certain areas. This paper discusses the motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of singlelayer models such as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks.