#### DMCA

## Structured Stochastic Variational Inference

### Citations

4363 | Latent Dirichlet allocation
- Blei, Ng, et al.
- 2003
(Show Context)
Citation Context ...ee hierarchical Bayesian models. In each case, relaxing the mean-field approximation allows SSVI and SSVI-A to find significantly better parameter estimates than mean-field. We also find evidence suggesting that this superior performance is primarily due to SSVI/SSVI-A’s ability to avoid local optima. 4.1 Latent Dirichlet allocation We evaluated the quality of parameter estimates from SSVI and SSVI-A on the latent Dirichlet allocation (LDA) topic model fit to the 3,800,000-document Wikipedia dataset from (Hoffman et al., 2013). We compared with full meanfield stochastic variational inference (Blei et al., 2003; Hoffman et al., 2010a), a mean-field M-step with a Gibbs sampling E-step (Mimno et al., 2012), SSVI with Gibbs, and SSVI-A with Gibbs. Results for other E-step/M-step combinations are in the supplement. To speed up learning, each update we subsample a minibatch of 1,000 documents rather than analyzing the whole dataset each iteration. We also experimented with various settings of the hyperparameters α and η, which mean-field variational inference for LDA is known to be quite sensitive to (Asuncion et al., 2009). For all algorithms we used a step size schedule ρ(t) = t−0.75. We held out a tes... |

2193 | Bayesian Data Analysis
- Gelman
- 1995
(Show Context)
Citation Context ...den variables z1:N are conditionally independent of one another given the global parameters β. We will restrict our attention to conditionally conjugate models. Specifically, we assume that p(β) = h(β) exp{η · t(β)−A(η)} p(yn, zn|β) = exp{t(β) · ηn(yn, zn) + gn(yn, zn)}, (1) where the base measure h and log-normalizer A are scalarvalued functions, η is a vector of natural parameters, t(β) is a vector-valued sufficient statistic function, gn is a scalarvalued function and ηn is a vector-valued function. This form for p(yn, zn, β) includes all conjugate pairs of distributions p(β), p(yn, zn|β) (Gelman et al., 2013); that is, it is the most general family of distributions for which the conditional p(β|y, z) is in the same family as the prior p(β). This conditional is p(β|y, z) = h(β) exp{(η + ∑ n ηn(yn, zn)) · t(β)− A(η + ∑ n ηn(yn, zn))}. (2) These restrictions are a weaker version of those imposed by Hoffman et al. (2013); the difference is that we make no assumptions about the tractability of the conditional distributions p(zn|yn, β) or p(zn,m|yn, zn,\m, β). This work is therefore applicable to any model that fits in the SVI framework, including mixture models, LDA, hidden Markov models (HMMs), factor... |

1145 |
Data analysis using regression and multilevel/hierarchical models. New York:
- Gelman, Hill
- 2006
(Show Context)
Citation Context ...n et al. (2013); the difference is that we make no assumptions about the tractability of the conditional distributions p(zn|yn, β) or p(zn,m|yn, zn,\m, β). This work is therefore applicable to any model that fits in the SVI framework, including mixture models, LDA, hidden Markov models (HMMs), factorial HMMs, Kalman filters, factor analyzers, probabilistic matrix factorizations, hierarchical linear regression, hierarchical probit regression, and many other hierarchical models. Unlike SVI, it can also address models without tractable local conditionals, such as multilevel logistic regressions (Gelman and Hill, 2007) or the correlated topic model (Blei and Lafferty, 2006). The conjugacy assumptions above simplify the form of the SSVI updates in algorithm 1, but they could be relaxed. However, the simpler SSVI-A updates in algorithm 1 depend strongly on these assumptions. 2.2 Approximating Distribution Our goal is to approximate the intractable posterior p(z, β|y) with a distribution q(z, β) in some restricted, tractable family. We will choose a q distribution from this family by solving an optimization problem, minimizing the Kullback-Leibler (KL) divergence between q(z, β) and the posterior p(z, β|y). Th... |

1062 |
Pattern recognition and machine learning
- Bishop
- 2009
(Show Context)
Citation Context ... q(β) to be in the same exponential family as the prior p(β), so that q(β) = h(β) exp{λ·t(β)−A(λ)}. λ is a vector of free parameters that controls q(β). We also require that any dependence under q between zn and β be mediated by some vector-valued function γn(β), so that we may write q(zn|β) = q(zn|γn(β)). This form for q allows for rich dependencies between nearly all model variables. This comes at a cost, however. In mean-field variational inference, we maximize a lower bound on the marginal probability of the data; this is equivalent to minimizing the KL divergence from q to the posterior (Bishop, 2006). However, this lower bound contains expectations that become impossible to compute when we allow zn to depend on β in q. This issue may seem insurmountable, but even though we cannot compute the bound, we can still optimize it using stochastic optimization. 2.3 The Structured Variational Objective Our goal is to find a distribution q(β, z) that has low KL divergence to the posterior p(β, z|y). The KL divergence between q and the full posterior is KL(qz,β ||pz,β|y) = −Eq[log p(y, z, β)]+ Eq[log q(z, β)] + log p(y). (4) Because the KL divergence must be non-negative, this yields the evidence lo... |

818 | Graphical models, exponential families, and variational inference
- Wainwright, Jordan
- 2008
(Show Context)
Citation Context ...thors. difficult to trust and interpret in this setting. Mean-field variational inference approximates the intractable posterior distribution implied by the model and data with a factorized approximating distribution in which all parameters are independent. This mean-field distribution is then tuned to minimize its Kullback-Leibler divergence to the posterior, which is equivalent to maximizing a lower bound on the marginal probability of the data. The restriction to factorized distributions makes the problem tractable, but reduces the fidelity of the approximation and introduces local optima (Wainwright and Jordan, 2008). A partial remedy is to weaken the mean-field factorization by restoring some dependencies, resulting in “structured” mean-field approximations (Saul and Jordan, 1996). The applicability, speed, effectiveness, and easeof-implementation of standard structured mean-field algorithms is limited because the lower bound implied by the structured distribution must be available in closed form. More recent work manages these intractable variational lower bounds using stochastic optimization, which allows one to optimize functions that can only be computed approximately. For example, Ji et al. (2010) u... |

681 | Dynamic Topic Models. In:
- Blei, Lafferty
- 2006
(Show Context)
Citation Context ...umptions about the tractability of the conditional distributions p(zn|yn, β) or p(zn,m|yn, zn,\m, β). This work is therefore applicable to any model that fits in the SVI framework, including mixture models, LDA, hidden Markov models (HMMs), factorial HMMs, Kalman filters, factor analyzers, probabilistic matrix factorizations, hierarchical linear regression, hierarchical probit regression, and many other hierarchical models. Unlike SVI, it can also address models without tractable local conditionals, such as multilevel logistic regressions (Gelman and Hill, 2007) or the correlated topic model (Blei and Lafferty, 2006). The conjugacy assumptions above simplify the form of the SSVI updates in algorithm 1, but they could be relaxed. However, the simpler SSVI-A updates in algorithm 1 depend strongly on these assumptions. 2.2 Approximating Distribution Our goal is to approximate the intractable posterior p(z, β|y) with a distribution q(z, β) in some restricted, tractable family. We will choose a q distribution from this family by solving an optimization problem, minimizing the Kullback-Leibler (KL) divergence between q(z, β) and the posterior p(z, β|y). The simplest approach is to make the mean-field approximat... |

628 | Markov chain sampling methods for Dirichlet process mixture models
- Neal
(Show Context)
Citation Context ...ynthetic spectra used in the GaPKL-NMF experiment. Magnitudes are shown in dB. π ∼ Dirichlet( αK ); φk,d ∼ Beta(1, 1); zn ∼ Multinomial(π); yn,d ∼ Bernoulli(φzn,d), where the size of π is K = 100 and the hyperparameter α = 20. Each observation yn is a 100-dimensional binary vector, and 1000 such vectors were sampled. The model ultimately used only 56 of the 100 mixture components to generate data. Given the correct hyperparameters, we used mean-field and SSVI-A1 (using the full dataset as a “minibatch”) to approximate the posterior p(z, π, φ|y). We also applied collapsed Gibbs sampling (CGS) (Neal, 2000), which yields samples from the posterior that are asymptotically unbiased. SSVI-A’s performance closely mirrored that of CGS; both methods were more accurate than mean-field. Mean-field only discovered 17 of the 56 mixture components; the rest were not significantly associated with data. By contrast, SSVI-A and CGS used 54 and 55 components respectively. We also estimated (using Monte Carlo) the KL divergence between the true data-generating distribution p(y|π, φ) and p(y|π, φ), where π and φ are the estimates of the posterior means of π and φ obtained by mean-field, SSVI-A, and CGS. Mean... |

209 | Online learning for latent dirichlet allocation. In
- Hoffman, Blei, et al.
- 2010
(Show Context)
Citation Context ...esian models. In each case, relaxing the mean-field approximation allows SSVI and SSVI-A to find significantly better parameter estimates than mean-field. We also find evidence suggesting that this superior performance is primarily due to SSVI/SSVI-A’s ability to avoid local optima. 4.1 Latent Dirichlet allocation We evaluated the quality of parameter estimates from SSVI and SSVI-A on the latent Dirichlet allocation (LDA) topic model fit to the 3,800,000-document Wikipedia dataset from (Hoffman et al., 2013). We compared with full meanfield stochastic variational inference (Blei et al., 2003; Hoffman et al., 2010a), a mean-field M-step with a Gibbs sampling E-step (Mimno et al., 2012), SSVI with Gibbs, and SSVI-A with Gibbs. Results for other E-step/M-step combinations are in the supplement. To speed up learning, each update we subsample a minibatch of 1,000 documents rather than analyzing the whole dataset each iteration. We also experimented with various settings of the hyperparameters α and η, which mean-field variational inference for LDA is known to be quite sensitive to (Asuncion et al., 2009). For all algorithms we used a step size schedule ρ(t) = t−0.75. We held out a test set of 10,000 docume... |

131 | Stochastic variational inference.
- Hoffman, Blei, et al.
- 2013
(Show Context)
Citation Context ...rs to resort to approximate inference algorithms such as mean-field variational inference or Markov chain Monte Carlo (MCMC). These classes of methods have complementary strengths and weaknesses—MCMC methods have strong asymptotic guarantees of unbiasedness but are often slow, while mean-field variational inference is often faster but tends to misrepresent important qualities of the posterior of interest and is more vulnerable to local optima. Incremental versions of both methods based on stochastic optimization have been developed that are applicable to large datasets (Welling and Teh, 2011; Hoffman et al., 2013). In this paper we focus on variational inference. In particular, we are interested in estimating the parameters of highdimensional Bayesian models with highly multimodal posteriors, such as mixture models, topic models, and factor models. We focus less on uncertainty estimates, which are Appearing in Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS) 2015, San Diego, CA, USA. JMLR: W&CP volume 38. Copyright 2015 by the authors. difficult to trust and interpret in this setting. Mean-field variational inference approximates the intractable poste... |

119 | On smoothing and inference for topic models. - Asuncion, Welling, et al. - 2009 |

117 | Exploiting tractable substructures in intractable networks. Neural Information Processing Systems.
- Saul, Jordan
- 1996
(Show Context)
Citation Context ...th a factorized approximating distribution in which all parameters are independent. This mean-field distribution is then tuned to minimize its Kullback-Leibler divergence to the posterior, which is equivalent to maximizing a lower bound on the marginal probability of the data. The restriction to factorized distributions makes the problem tractable, but reduces the fidelity of the approximation and introduces local optima (Wainwright and Jordan, 2008). A partial remedy is to weaken the mean-field factorization by restoring some dependencies, resulting in “structured” mean-field approximations (Saul and Jordan, 1996). The applicability, speed, effectiveness, and easeof-implementation of standard structured mean-field algorithms is limited because the lower bound implied by the structured distribution must be available in closed form. More recent work manages these intractable variational lower bounds using stochastic optimization, which allows one to optimize functions that can only be computed approximately. For example, Ji et al. (2010) use mean-field approximations to the posteriors of “collapsed” models where some parameters have been analytically marginalized out, Salimans and Knowles (2013) apply a ... |

111 | Evaluation methods for topic models. In:
- Wallach, Murray, et al.
- 2009
(Show Context)
Citation Context ...whole dataset each iteration. We also experimented with various settings of the hyperparameters α and η, which mean-field variational inference for LDA is known to be quite sensitive to (Asuncion et al., 2009). For all algorithms we used a step size schedule ρ(t) = t−0.75. We held out a test set of 10,000 documents, and periodically evaluated the average per-word marginal log probability assigned by the model to each test document, using the expected value under the variational distribution as a point estimate of the topics. We estimated marginal log probabilities with a Chib-style estimator (Wallach et al., 2009). Figure 2 summarizes the results for α = 0.1, which yielded the best results for all algorithms. The method of Mimno et al. (2012) outperforms the online LDA algorithm of Hoffman et al. (2010a), but both methods are very sensitive to hyperparameter selection. SSVI achieves good results regardless of hyperparameter choice. SSVI-A’s performance is very slightly worse than that of SSVI. The approximating Dirichlet distributions found by SSVI were more diffuse than those found by SVI or SSVI-A, having lower overall parameter values λ. Many elements of λ were smaller than the prior hyperparameter ... |

69 | Online model selection based on the variational Bayes. - Sato - 2001 |

50 | Bayesian Learning via Stochastic Gradient Langevin Dynamics.
- Welling, Teh
- 2011
(Show Context)
Citation Context ...dels drives practitioners to resort to approximate inference algorithms such as mean-field variational inference or Markov chain Monte Carlo (MCMC). These classes of methods have complementary strengths and weaknesses—MCMC methods have strong asymptotic guarantees of unbiasedness but are often slow, while mean-field variational inference is often faster but tends to misrepresent important qualities of the posterior of interest and is more vulnerable to local optima. Incremental versions of both methods based on stochastic optimization have been developed that are applicable to large datasets (Welling and Teh, 2011; Hoffman et al., 2013). In this paper we focus on variational inference. In particular, we are interested in estimating the parameters of highdimensional Bayesian models with highly multimodal posteriors, such as mixture models, topic models, and factor models. We focus less on uncertainty estimates, which are Appearing in Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS) 2015, San Diego, CA, USA. JMLR: W&CP volume 38. Copyright 2015 by the authors. difficult to trust and interpret in this setting. Mean-field variational inference approximate... |

43 | Sparse stochastic inference for latent Dirichlet allocation.
- Mimno, Hoffman, et al.
- 2012
(Show Context)
Citation Context ... structured mean-field algorithms is limited because the lower bound implied by the structured distribution must be available in closed form. More recent work manages these intractable variational lower bounds using stochastic optimization, which allows one to optimize functions that can only be computed approximately. For example, Ji et al. (2010) use mean-field approximations to the posteriors of “collapsed” models where some parameters have been analytically marginalized out, Salimans and Knowles (2013) apply a structured approximation to the posterior of a stochastic volatility model, and Mimno et al. (2012) use a structured approximation to the posterior of a collapsed model. In parallel, Hoffman et al. (2013) proposed the stochastic variational inference (SVI) framework, which uses stochastic optimization to apply mean-field variational inference to massive datasets. SVI splits the unobserved variables in a hierarchical model into global parameters β (which are shared across all observations) and groups of local hidden variables z1, . . . , zN (each of which is specific to a small group of observations yn). The goal is to minimize the Kullback-Leibler (KL) divergence between a tractable approxi... |

31 | Streaming variational bayes
- Broderick, Boyd, et al.
- 2013
(Show Context)
Citation Context ... approximation to p(β|y). 2.6 Extensions There are several ways in which the basic algorithms presented above can be extended: Subsampling the data. As in SVI, we can compute an unbiased estimate of the sum over n in equation 8 by only computing ηn for some randomly sampled subset of S observations, resulting in the update λ′ = (1− ρ)λ+ ρ(η + V (β, λ)NS ∑ n ηn). (12) For large datasets, the reduced computational effort of only looking at a fraction of the data far outweighs the noise that 364 Matthew D. Hoffman, David M. Blei this subsampling introduces. Taking a cue from the recent work of Broderick et al. (2013) and Wang and Blei (2012), we suggest gradually ramping up the multiplier N over the course of the first sweep over the dataset. Hyperparameter updates and parameter hierarchies. As in the mean-field stochastic variational inference framework of Hoffman et al. (2013), we can optimize any hyperparameters in our model by taking steps in the direction of the gradient of the ELBO with respect to those hyperparameters. We can also extend the framework developed in this paper to models with hierarchies of global parameters as in appendix A of (Hoffman et al., 2013). 3 Related Work The idea of sampli... |

31 | Bayesian nonparametric matrix factorization for recorded music,” in
- Hoffman, Blei, et al.
- 2010
(Show Context)
Citation Context ...esian models. In each case, relaxing the mean-field approximation allows SSVI and SSVI-A to find significantly better parameter estimates than mean-field. We also find evidence suggesting that this superior performance is primarily due to SSVI/SSVI-A’s ability to avoid local optima. 4.1 Latent Dirichlet allocation We evaluated the quality of parameter estimates from SSVI and SSVI-A on the latent Dirichlet allocation (LDA) topic model fit to the 3,800,000-document Wikipedia dataset from (Hoffman et al., 2013). We compared with full meanfield stochastic variational inference (Blei et al., 2003; Hoffman et al., 2010a), a mean-field M-step with a Gibbs sampling E-step (Mimno et al., 2012), SSVI with Gibbs, and SSVI-A with Gibbs. Results for other E-step/M-step combinations are in the supplement. To speed up learning, each update we subsample a minibatch of 1,000 documents rather than analyzing the whole dataset each iteration. We also experimented with various settings of the hyperparameters α and η, which mean-field variational inference for LDA is known to be quite sensitive to (Asuncion et al., 2009). For all algorithms we used a step size schedule ρ(t) = t−0.75. We held out a test set of 10,000 docume... |

25 | Variational Bayesian inference with stochastic search.
- Blei, Jordan, et al.
- 2012
(Show Context)
Citation Context ...stochastic variational inference framework of Hoffman et al. (2013), we can optimize any hyperparameters in our model by taking steps in the direction of the gradient of the ELBO with respect to those hyperparameters. We can also extend the framework developed in this paper to models with hierarchies of global parameters as in appendix A of (Hoffman et al., 2013). 3 Related Work The idea of sampling from global variational distributions to optimize intractable variational inference problems has been proposed previously in several contexts. Ji et al. (2010); Nott et al. (2012); Gerrish (2013); Paisley et al. (2012), and Ranganath et al. (2014) proposed sampling without a change of variables as a way of coping with non-conjugacy. Kingma and Welling (2014) and Titsias and Lazaro-Gredilla (2014) proposed methods that do use a change of variables, although their methods focus more on speed and/or dealing with nonconjugacy than on improving the accuracy of the variational approximation. Salimans and Knowles (2013) also used a change of variables to dramatically reduce the variance of their stochastic gradients. Although some of the above methods use stochastic optimization to improve the quality of the mean... |

19 | Black box variational inference.
- Ranganath, Gerrish, et al.
- 2014
(Show Context)
Citation Context ...rence framework of Hoffman et al. (2013), we can optimize any hyperparameters in our model by taking steps in the direction of the gradient of the ELBO with respect to those hyperparameters. We can also extend the framework developed in this paper to models with hierarchies of global parameters as in appendix A of (Hoffman et al., 2013). 3 Related Work The idea of sampling from global variational distributions to optimize intractable variational inference problems has been proposed previously in several contexts. Ji et al. (2010); Nott et al. (2012); Gerrish (2013); Paisley et al. (2012), and Ranganath et al. (2014) proposed sampling without a change of variables as a way of coping with non-conjugacy. Kingma and Welling (2014) and Titsias and Lazaro-Gredilla (2014) proposed methods that do use a change of variables, although their methods focus more on speed and/or dealing with nonconjugacy than on improving the accuracy of the variational approximation. Salimans and Knowles (2013) also used a change of variables to dramatically reduce the variance of their stochastic gradients. Although some of the above methods use stochastic optimization to improve the quality of the mean-field approximation, there a... |

18 |
Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114.
- Kingma, Welling
- 2014
(Show Context)
Citation Context ...he direction of the gradient of the ELBO with respect to those hyperparameters. We can also extend the framework developed in this paper to models with hierarchies of global parameters as in appendix A of (Hoffman et al., 2013). 3 Related Work The idea of sampling from global variational distributions to optimize intractable variational inference problems has been proposed previously in several contexts. Ji et al. (2010); Nott et al. (2012); Gerrish (2013); Paisley et al. (2012), and Ranganath et al. (2014) proposed sampling without a change of variables as a way of coping with non-conjugacy. Kingma and Welling (2014) and Titsias and Lazaro-Gredilla (2014) proposed methods that do use a change of variables, although their methods focus more on speed and/or dealing with nonconjugacy than on improving the accuracy of the variational approximation. Salimans and Knowles (2013) also used a change of variables to dramatically reduce the variance of their stochastic gradients. Although some of the above methods use stochastic optimization to improve the quality of the mean-field approximation, there are major differences between these methods and SSVI. Ji et al. (2010) apply their method to models where some par... |

15 | Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis,
- Salimans, Knowles
- 2013
(Show Context)
Citation Context ...proximations (Saul and Jordan, 1996). The applicability, speed, effectiveness, and easeof-implementation of standard structured mean-field algorithms is limited because the lower bound implied by the structured distribution must be available in closed form. More recent work manages these intractable variational lower bounds using stochastic optimization, which allows one to optimize functions that can only be computed approximately. For example, Ji et al. (2010) use mean-field approximations to the posteriors of “collapsed” models where some parameters have been analytically marginalized out, Salimans and Knowles (2013) apply a structured approximation to the posterior of a stochastic volatility model, and Mimno et al. (2012) use a structured approximation to the posterior of a collapsed model. In parallel, Hoffman et al. (2013) proposed the stochastic variational inference (SVI) framework, which uses stochastic optimization to apply mean-field variational inference to massive datasets. SVI splits the unobserved variables in a hierarchical model into global parameters β (which are shared across all observations) and groups of local hidden variables z1, . . . , zN (each of which is specific to a small group o... |

14 | Stochastic collapsed variational bayesian inference for latent dirichlet allocation. In KDD,
- Foulds, Boyles, et al.
- 2013
(Show Context)
Citation Context ...(t))λ(t−1) + ρ(t)(η + ∑ n ηn). 10: until convergence reduces the algorithm’s ability to approximate the marginal posterior p(β|y). That said, this is still better than a full mean-field approach where one also breaks the dependencies between z and β. Non-bound-preserving approaches: Although we derived SSVI assuming that q(zn|β) is chosen to maximize the local ELBO Ln, one could obtain a distribution over zn in other ways. For example, for latent Dirichlet allocation one could adapt the CVB0 method of Asuncion et al. (2009) to use a fixed value of β, resulting in an algorithm akin to that of Foulds et al. (2013). All of these choices of local variational distribution could also be used in a traditional mean-field setup where q(β, z) = q(β)q(z), or as part of a variational maximum a posteriori (MAP) estimation algorithm. So for any model that enjoys conditional conjugacy we have a matrix of possible variational inference algorithms: we can match any “E-step” (e.g. mean-field or sampling from the exact conditional) used to approximate p(zn|yn, β) with any “M-step” (e.g. MAP, mean-field, SSVI, SSVI-A) used to update our approximation to p(β|y). 2.6 Extensions There are several ways in which the basic al... |

13 | Stochastic gradient riemannian langevin dynamics on the probability simplex,
- Patterson, Teh
- 2013
(Show Context)
Citation Context ... Hoffman et al. (2010a), but both methods are very sensitive to hyperparameter selection. SSVI achieves good results regardless of hyperparameter choice. SSVI-A’s performance is very slightly worse than that of SSVI. The approximating Dirichlet distributions found by SSVI were more diffuse than those found by SVI or SSVI-A, having lower overall parameter values λ. Many elements of λ were smaller than the prior hyperparameter η, a solution that is not stable under the SVI and SSVI-A updates. We also evaluated the stochastic gradient Riemannian Langevin dynamics (SGRLD) algorithm for LDA, which Patterson and Teh (2013) found outperformed the Gibbswithin-SVI method of Mimno et al. (2012). We experimented with various hyperparameter settings, including 365 Structured Stochastic Variational Inference Hoffman et al. (2010a) Mimno et al. (2012) SSVI (this paper) SSVI−A (this paper) −7.2 −7.1 −7.0 −6.9 −6.8 0 5000 10000 15000 20000 0 5000 10000 15000 20000 0 5000 10000 15000 20000 0 5000 10000 15000 20000 Time in secondsH el d− ou t p er − w or d lo g− lik el ih oo d eta 0.1 0.3 1 Figure 2: Predictive accuracy for various algorithms and hyperparameter settings as a function of wallclock time when fitting LDA to 3... |

11 | Doubly stochastic variational Bayes for non-conjugate inference. - Titsias, Lazaro-Gredilla - 2014 |

10 | Bayesian nonparametric spectrogram modeling based on infinite factorial infinite hidden Markov model. - Nakano, Roux, et al. - 2011 |

6 | Regression density estimation with variational methods and stochastic approximation.
- Nott, Tan, et al.
- 2012
(Show Context)
Citation Context ...r hierarchies. As in the mean-field stochastic variational inference framework of Hoffman et al. (2013), we can optimize any hyperparameters in our model by taking steps in the direction of the gradient of the ELBO with respect to those hyperparameters. We can also extend the framework developed in this paper to models with hierarchies of global parameters as in appendix A of (Hoffman et al., 2013). 3 Related Work The idea of sampling from global variational distributions to optimize intractable variational inference problems has been proposed previously in several contexts. Ji et al. (2010); Nott et al. (2012); Gerrish (2013); Paisley et al. (2012), and Ranganath et al. (2014) proposed sampling without a change of variables as a way of coping with non-conjugacy. Kingma and Welling (2014) and Titsias and Lazaro-Gredilla (2014) proposed methods that do use a change of variables, although their methods focus more on speed and/or dealing with nonconjugacy than on improving the accuracy of the variational approximation. Salimans and Knowles (2013) also used a change of variables to dramatically reduce the variance of their stochastic gradients. Although some of the above methods use stochastic optimiza... |

5 | Truncation-free stochastic variational inference for Bayesian nonparametric models.
- Wang, Blei
- 2012
(Show Context)
Citation Context ...6 Extensions There are several ways in which the basic algorithms presented above can be extended: Subsampling the data. As in SVI, we can compute an unbiased estimate of the sum over n in equation 8 by only computing ηn for some randomly sampled subset of S observations, resulting in the update λ′ = (1− ρ)λ+ ρ(η + V (β, λ)NS ∑ n ηn). (12) For large datasets, the reduced computational effort of only looking at a fraction of the data far outweighs the noise that 364 Matthew D. Hoffman, David M. Blei this subsampling introduces. Taking a cue from the recent work of Broderick et al. (2013) and Wang and Blei (2012), we suggest gradually ramping up the multiplier N over the course of the first sweep over the dataset. Hyperparameter updates and parameter hierarchies. As in the mean-field stochastic variational inference framework of Hoffman et al. (2013), we can optimize any hyperparameters in our model by taking steps in the direction of the gradient of the ELBO with respect to those hyperparameters. We can also extend the framework developed in this paper to models with hierarchies of global parameters as in appendix A of (Hoffman et al., 2013). 3 Related Work The idea of sampling from global variationa... |

4 | Applications of Latent Variable Models in Modeling Influence and Decision Making. PhD thesis,
- Gerrish
- 2013
(Show Context)
Citation Context ... the mean-field stochastic variational inference framework of Hoffman et al. (2013), we can optimize any hyperparameters in our model by taking steps in the direction of the gradient of the ELBO with respect to those hyperparameters. We can also extend the framework developed in this paper to models with hierarchies of global parameters as in appendix A of (Hoffman et al., 2013). 3 Related Work The idea of sampling from global variational distributions to optimize intractable variational inference problems has been proposed previously in several contexts. Ji et al. (2010); Nott et al. (2012); Gerrish (2013); Paisley et al. (2012), and Ranganath et al. (2014) proposed sampling without a change of variables as a way of coping with non-conjugacy. Kingma and Welling (2014) and Titsias and Lazaro-Gredilla (2014) proposed methods that do use a change of variables, although their methods focus more on speed and/or dealing with nonconjugacy than on improving the accuracy of the variational approximation. Salimans and Knowles (2013) also used a change of variables to dramatically reduce the variance of their stochastic gradients. Although some of the above methods use stochastic optimization to improve ... |

3 | Bounded approximations for marginal likelihoods.
- Ji, Shen, et al.
- 2010
(Show Context)
Citation Context ... and Jordan, 2008). A partial remedy is to weaken the mean-field factorization by restoring some dependencies, resulting in “structured” mean-field approximations (Saul and Jordan, 1996). The applicability, speed, effectiveness, and easeof-implementation of standard structured mean-field algorithms is limited because the lower bound implied by the structured distribution must be available in closed form. More recent work manages these intractable variational lower bounds using stochastic optimization, which allows one to optimize functions that can only be computed approximately. For example, Ji et al. (2010) use mean-field approximations to the posteriors of “collapsed” models where some parameters have been analytically marginalized out, Salimans and Knowles (2013) apply a structured approximation to the posterior of a stochastic volatility model, and Mimno et al. (2012) use a structured approximation to the posterior of a collapsed model. In parallel, Hoffman et al. (2013) proposed the stochastic variational inference (SVI) framework, which uses stochastic optimization to apply mean-field variational inference to massive datasets. SVI splits the unobserved variables in a hierarchical model into... |