@MISC{_figure1:, author = {}, title = {Figure 1: The collaborative topic Poisson factorization model (CTPF). A Stochastic variational inference for the collaborative topic}, year = {} }

Share

OpenURL

Abstract

Poisson factorization model Stochastic variational inference combines stochastic gradient algorithms and variational infer-ence [3]. Stochastic gradient algorithms follow noisy estimates of the gradient with a decreasing step-size. If the expectation of the noisy gradient equals to the gradient and if the step-size de-creases according to a certain schedule, then the algorithm converges to a local optimum [4]. To obtain noisy gradients, assume that we operate under the setting where we subsample a single doc-ument d uniformly at random from the D documents. This sampling strategy is similar to online LDA [2]. However, our approach differs in the use of separate learning rates for each user, allowing the inference to update only the relevant users in each iteration. We are given observations about a single document in each iteration. Following [3], we use the conditional dependencies in our graphical model to divide our variational parameters into local and global. The multinomial parameters (φdv, ξud) for the sampled document d and for all u ∈ U, and the Gamma parameters for (θdk, dk) are local. All other varational parameters are global. In each iteration of our algorithm, we first subsample a document. We then update the local multi-nomial parameters and the local topic intensities and offset parameters for this document using the coordinate updates from Figure 2. This optimizes the local parameters with respect to the subsample. We then compute scaled natural gradients [1] for the global user preference parameters (η̃shpuk, η̃ rte uk) for the users u that have rated document d and for all topic parameters (β̃shpvk, β̃ rte vk). The global step for the global parameters follows the noisy gradient with an appropriate step-size. We maintain separate learning rates ρu for each user, and only update the users who have rated the document d. We proceed similarly for words. We maintain a global learning rate ρ ′ for the topic parameters, which are updated in each iteration. For each of these learning rates ρ, we require that∑ t ρ(t) 2 < ∞ and∑t ρ(t) = ∞ for convergence to a local optimum [4]. We set ρ(t) = (τ0+t)−κ, where κ ∈ (0.5, 1] is the learning rate and τ0 ≥ 0 downweights early iterations [3].