#### DMCA

## Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering

### Citations

1540 |
Handbook of mathematical functions with formulas, graphs, and mathematical tables
- Abramowitz, Stegun
- 1970
(Show Context)
Citation Context ... the proof for d = m = 1 and yt = 0 (constant observations) to make the proof simpler. We conjecture that similar results hold more generally. We use the Mehler formula for w such that 2w1−w2 = 1 τ2 (=-=Abramowitz and Stegun, 2012-=-), where Hn is the n th Hermite polynomial (Hn(x) := (−1)ne−x2 dndxn e−x 2 ): e2xyw/(1−w 2) = √ 1− w2e(x2+y2)w2/(1−w2) ∞∑ n=0 1 n! (w/2)nHn(x)Hn(y). Thus p(xt+1|xt)p(yt|xt) = 1 ( √ 2piτ) e− 1 2τ2 x2t+... |

1049 | On sequential Monte Carlo sampling methods for Bayesian filtering
- Doucet, Godsill, et al.
- 2000
(Show Context)
Citation Context ...national Conference on Artificial Intelligence and Statistics (AISTATS) 2015, San Diego, CA, USA. JMLR: W&CP volume 38. Copyright 2015 by the authors. where xt ∈ X denotes the latent state variable and yt ∈ Y the observation at time t. Exact state inference in SSMs is possible, essentially, only when the model is linear and Gaussian or when the state-space X is a finite set. For solving the inference problem beyond these restricted model classes, sequential Monte Carlo methods, i.e. particle filters (PFs), have emerged as a key tool; see e.g., Doucet and Johansen (2011); Cappe et al. (2005); Doucet et al. (2000). However, since these methods are based on Monte Carlo integration they are inherently affected by sampling variance, which can degrade the performance of the estimators. Particular challenges arise in the case when the observation likelihood p(yt|xt) is computationally expensive to evaluate. For instance, this is common in robotics applications where the observation model relates the sensory input of the robot, which can comprise vision-based systems, laser rangefinders, synthetic aperture radars, etc. For such systems, simply evaluating the observation function for a fixed value of xt can t... |

772 | Filtering via simulation: auxiliary particle filter
- Pitt, Shephard
- 1999
(Show Context)
Citation Context ...simply evaluating the observation function for a fixed value of xt can therefore involve computationally expensive operations, such as image processing, point-set registration, and related tasks. This poses difficulties for particle-filtering-based solutions for two reasons: (1) the computational bottleneck arising from the likelihood evaluation implies that we cannot simply increase the number of particles to improve the accuracy, and (2) this type of “complicated” observation models will typically not allow for adaptation of the proposal distribution used within the filter, in the spirit of Pitt and Shephard (1999), leaving us with the standard— but inefficient—bootstrap proposal as the only viable option. On the contrary, for these systems, the dynamical model p(xt|xt−1) is often comparatively simple, e.g. being a linear and Gaussian “nearly constant acceleration” model (Ristic et al., 2004). The method developed in this paper is geared toward this class of filtering problems. The basic idea is that, in scenarios when the likelihood evaluation is the computational bottleneck, we can afford to spend additional computations to improve upon the sampling of 544 Sequential Kernel Herding: Frank-Wolfe Optimi... |

567 |
Beyond the Kalman filter: Particle filters for tracking applications. Artech House
- Ristic, Arulampalam, et al.
- 2004
(Show Context)
Citation Context ...putational bottleneck arising from the likelihood evaluation implies that we cannot simply increase the number of particles to improve the accuracy, and (2) this type of “complicated” observation models will typically not allow for adaptation of the proposal distribution used within the filter, in the spirit of Pitt and Shephard (1999), leaving us with the standard— but inefficient—bootstrap proposal as the only viable option. On the contrary, for these systems, the dynamical model p(xt|xt−1) is often comparatively simple, e.g. being a linear and Gaussian “nearly constant acceleration” model (Ristic et al., 2004). The method developed in this paper is geared toward this class of filtering problems. The basic idea is that, in scenarios when the likelihood evaluation is the computational bottleneck, we can afford to spend additional computations to improve upon the sampling of 544 Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering the particles. By doing so, we can avoid excessive variance arising from simple Monte Carlo sampling from the bootstrap proposal. Contributions. We build on the optimization view from Bach et al. (2012) of kernel herding (Chen et al., 2010) to approxima... |

308 |
An algorithm for quadratic programming
- Frank, Wolfe
- 1956
(Show Context)
Citation Context ...µ(p)‖H. We note that µ(p) lies in the marginal polytopeM⊂ H, defined as the closure of the convex-hull of Φ(X ). We suppose that Φ(x) is uniformly bounded in the feature space, that is, there is a finite R such that ‖Φ(x)‖H ≤ R ∀x ∈ X . This means thatM is a closed bounded convex subset ofH, and we could in theory optimize over it. This insight was used by Bach et al. (2012) who considered using the FrankWolfe optimization algorithm to optimize the convex function J(g) := 12‖g − µ(p)‖ 2 H over M to obtain adaptive quadrature rules. The Frank-Wolfe algorithm (also called conditional gradient) (Frank and Wolfe, 1956) is a simple first-order iterative constrained optimization algorithm for optimizing smooth functions over closed bounded convex sets like M (see Dunn (1980) for its convergence analysis on general infinite dimensional Banach spaces). At every iteration, the algorithm finds a good feasible search vertex of M by minimizing the linearization of J at the current iterate gk: gk+1 = arg ming∈M〈J ′(gk), g〉. The next iterate is then obtained by a suitable convex combination of the search vertex gk+1 and the previous iterate gk: gk+1 = (1− γk)gk + γkgk+1 for a suitable step-size γk from a fixed sch... |

267 | Improved particle filter for nonlinear problems
- P, Clifford
- 1999
(Show Context)
Citation Context ... mixture weight (i.e., choosing a past particle x (i) 1:t to propagate), and then sampling its next extension state x (i) t+1 with probability p(xt+1|x(i)t ). The standard bootstrap particle filter is thus obtained by maintaining uniform weight for the predictive distribution (w (i) t = 1 N ) and randomly sampling from (4) to obtain the particles at time t+1. This gives an unbiased estimate of pt+1: Ept+1 [pt+1] = pt+1. Lower variance estimators can be obtained by using a different resampling mechanism for the particles than this multinomial sampling scheme, such as stratified resampling (Carpenter et al., 1999) and are usually used in practice instead. One way to improve the particle filter is thus to replace the random sampling stage of step 3 with different sampling mechanisms with lower variance or better approximation properties of the distribution pt+1 that we are trying to approximate. As we obtain the normalization constants Wt by integrating the observation probability, it seems natural to look for particle point sets with better integration properties. By replacing random sampling with a quasi-random number sequence, we obtain the already proposed sequential quasi-Monte Carlo scheme (Philo... |

222 | Mixture kalman filters, - Chen, Liu - 2000 |

216 | C: Reproducing Kernel Hilbert Space in Probability and Statistics - Berlinet, Thomas-Agnan - 2004 |

214 | A tutorial on particle filtering and smoothing: fifteen years later. In The Oxford Handbook of Nonlinear filtering
- Doucet, Johansen
- 2011
(Show Context)
Citation Context ...t), (1) Appearing in Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS) 2015, San Diego, CA, USA. JMLR: W&CP volume 38. Copyright 2015 by the authors. where xt ∈ X denotes the latent state variable and yt ∈ Y the observation at time t. Exact state inference in SSMs is possible, essentially, only when the model is linear and Gaussian or when the state-space X is a finite set. For solving the inference problem beyond these restricted model classes, sequential Monte Carlo methods, i.e. particle filters (PFs), have emerged as a key tool; see e.g., Doucet and Johansen (2011); Cappe et al. (2005); Doucet et al. (2000). However, since these methods are based on Monte Carlo integration they are inherently affected by sampling variance, which can degrade the performance of the estimators. Particular challenges arise in the case when the observation likelihood p(yt|xt) is computationally expensive to evaluate. For instance, this is common in robotics applications where the observation model relates the sensory input of the robot, which can comprise vision-based systems, laser rangefinders, synthetic aperture radars, etc. For such systems, simply evaluating the observ... |

84 | Inference in hidden Markov models - Cappé, Moulines, et al. - 2005 |

84 | Revisiting Frank-Wolfe: Projection-free sparse convex optimization
- Jaggi
- 2013
(Show Context)
Citation Context ...ex gk+1 and the previous iterate gk: gk+1 = (1− γk)gk + γkgk+1 for a suitable step-size γk from a fixed schedule (e.g. 1/(k+ 1)) or by using linesearch. A crucial property of this algorithm is that the iterate gk is thus a convex combination of the vertices of M visited so far. This provides a sparse expansion for the iterate, and makes the algorithm suitable 545 Simon Lacoste-Julien, Fredrik Lindsten, Francis Bach to high-dimensional optimization (or even infinite) – this explains in part the regain of interest in machine learning in the last decade for this old optimization algorithm (see Jaggi (2013) for a recent survey). In our setup whereM is the convex hull of Φ(X ), the vertices of M are thus of the form gk+1 = Φ(x(k+1)) for some x(k+1) ∈ X . Running Frank-Wolfe on M thus yields gk = ∑k i=1 w (i) k Φ(x (i)) = Ep[Φ] for some weighted set of points {w(i)k , x(i)}ki=1. The iterate gk thus corresponds to a quadrature rule p of the form of (2) and gk = Ep[Φ], and this is the relationship that was explored in Bach et al. (2012). Running Frank-Wolfe optimization with the step-size of γk = 1/(k+ 1) reduces to the kernel herding algorithm proposed by Chen et al. (2010). See also Huszar an... |

71 | Hilbert space embeddings and metrics on probability measures. - Sriperumbudur, Gretton, et al. - 2010 |

47 | A Kernel Two-Sample Test. - Gretton, Borgwardt, et al. - 2012 |

42 |
Summing and nuclear norms in Banach space theory
- Jameson
- 1987
(Show Context)
Citation Context ...em 1 Proof sketch. We assume that the function ft : (xt+1, xt) 7→ p(yt|xt)p(xt+1|xt) is in the tensor product Ft ⊗Ht, with Ft defined as in the statement of the theorem. We consider the nuclear norm (=-=Jameson, 1987-=-): ‖ft‖Ft⊗Ht = inf{αi,βi}∞i=1 ∑ i ‖αi‖Ft‖βi‖Ht over all possible decompositions {αi, βi}∞i=1 of ft such that, for all xt, xt+1 ft(xt+1, xt) = ∑ i αi(xt+1)βi(xt). In the following, let {αi, βi}∞i=1 be ... |

25 | Quasirandom sampling for condensation.
- Philomin, Duraiswami, et al.
- 2000
(Show Context)
Citation Context ...1999) and are usually used in practice instead. One way to improve the particle filter is thus to replace the random sampling stage of step 3 with different sampling mechanisms with lower variance or better approximation properties of the distribution pt+1 that we are trying to approximate. As we obtain the normalization constants Wt by integrating the observation probability, it seems natural to look for particle point sets with better integration properties. By replacing random sampling with a quasi-random number sequence, we obtain the already proposed sequential quasi-Monte Carlo scheme (Philomin et al., 2000; Ormoneit et al., 2001; Gerber and Chopin, 2014). The 547 Simon Lacoste-Julien, Fredrik Lindsten, Francis Bach main contribution of our work is to instead propose to use Frank-Wolfe quadrature in step 3 of the particle filter to obtain better (adapted) point sets. 3.2 Sequential kernel herding In the sequential kernel herding (SKH) algorithm, we simply replace step 3 of Algorithm 2 with pt = FW-Quad(pt,Ht, N). As mentioned in the introduction, many dynamical models used in practice assume Gaussian transitions. Therefore, we will put particular emphasis on the case when (more generally) p(xt... |

20 | Super-samples from kernel herding.
- Chen, Welling, et al.
- 2010
(Show Context)
Citation Context ...on” model (Ristic et al., 2004). The method developed in this paper is geared toward this class of filtering problems. The basic idea is that, in scenarios when the likelihood evaluation is the computational bottleneck, we can afford to spend additional computations to improve upon the sampling of 544 Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering the particles. By doing so, we can avoid excessive variance arising from simple Monte Carlo sampling from the bootstrap proposal. Contributions. We build on the optimization view from Bach et al. (2012) of kernel herding (Chen et al., 2010) to approximate the integrals appearing in the Bayesian filtering recursions. We make use of the Frank-Wolfe (FW) quadrature to approximate, in particular, mixtures of Gaussians which often arise in a particle filtering context as the mixture over past particles in the distribution over the next state. We use this approach within a filtering framework and prove theoretical convergence results for the resulting method, denoted as sequential kernel herding (SKH), giving one of the first explicit better convergence rates than for a particle filter. Our preliminary experiments show that SKH can gi... |

19 | On the equivalence between herding and conditional gradient algorithms
- Bach, Lacoste-Julien, et al.
- 2012
(Show Context)
Citation Context ...d Gaussian “nearly constant acceleration” model (Ristic et al., 2004). The method developed in this paper is geared toward this class of filtering problems. The basic idea is that, in scenarios when the likelihood evaluation is the computational bottleneck, we can afford to spend additional computations to improve upon the sampling of 544 Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering the particles. By doing so, we can avoid excessive variance arising from simple Monte Carlo sampling from the bootstrap proposal. Contributions. We build on the optimization view from Bach et al. (2012) of kernel herding (Chen et al., 2010) to approximate the integrals appearing in the Bayesian filtering recursions. We make use of the Frank-Wolfe (FW) quadrature to approximate, in particular, mixtures of Gaussians which often arise in a particle filtering context as the mixture over past particles in the distribution over the next state. We use this approach within a filtering framework and prove theoretical convergence results for the resulting method, denoted as sequential kernel herding (SKH), giving one of the first explicit better convergence rates than for a particle filter. Our prelim... |

15 |
Convergence rates for conditional gradient sequences generated by implicit step length rules
- Dunn
- 1980
(Show Context)
Citation Context ...e space, that is, there is a finite R such that ‖Φ(x)‖H ≤ R ∀x ∈ X . This means thatM is a closed bounded convex subset ofH, and we could in theory optimize over it. This insight was used by Bach et al. (2012) who considered using the FrankWolfe optimization algorithm to optimize the convex function J(g) := 12‖g − µ(p)‖ 2 H over M to obtain adaptive quadrature rules. The Frank-Wolfe algorithm (also called conditional gradient) (Frank and Wolfe, 1956) is a simple first-order iterative constrained optimization algorithm for optimizing smooth functions over closed bounded convex sets like M (see Dunn (1980) for its convergence analysis on general infinite dimensional Banach spaces). At every iteration, the algorithm finds a good feasible search vertex of M by minimizing the linearization of J at the current iterate gk: gk+1 = arg ming∈M〈J ′(gk), g〉. The next iterate is then obtained by a suitable convex combination of the search vertex gk+1 and the previous iterate gk: gk+1 = (1− γk)gk + γkgk+1 for a suitable step-size γk from a fixed schedule (e.g. 1/(k+ 1)) or by using linesearch. A crucial property of this algorithm is that the iterate gk is thus a convex combination of the vertices of M v... |

11 | A Hilbert space embedding for distributions. In Algorithmic Learning Theory, - Smola, Gretton, et al. - 2007 |

8 | Particle filter SLAM with high dimensional vehicle model. - Tornqvist, Schon, et al. - 2009 |

7 | Optimally-weighted herding is Bayesian quadrature. - Huszar, Duvenaud - 2012 |

6 | Lattice particle filters.
- Ormoneit, Lemieux, et al.
- 2001
(Show Context)
Citation Context ...sed in practice instead. One way to improve the particle filter is thus to replace the random sampling stage of step 3 with different sampling mechanisms with lower variance or better approximation properties of the distribution pt+1 that we are trying to approximate. As we obtain the normalization constants Wt by integrating the observation probability, it seems natural to look for particle point sets with better integration properties. By replacing random sampling with a quasi-random number sequence, we obtain the already proposed sequential quasi-Monte Carlo scheme (Philomin et al., 2000; Ormoneit et al., 2001; Gerber and Chopin, 2014). The 547 Simon Lacoste-Julien, Fredrik Lindsten, Francis Bach main contribution of our work is to instead propose to use Frank-Wolfe quadrature in step 3 of the particle filter to obtain better (adapted) point sets. 3.2 Sequential kernel herding In the sequential kernel herding (SKH) algorithm, we simply replace step 3 of Algorithm 2 with pt = FW-Quad(pt,Ht, N). As mentioned in the introduction, many dynamical models used in practice assume Gaussian transitions. Therefore, we will put particular emphasis on the case when (more generally) p(xt|x1:(t−1), y1:(t−1)) is... |

5 | Long-term stability of sequential Monte Carlo methods under verifiable conditions.
- Douc, Moulines, et al.
- 2014
(Show Context)
Citation Context ...1 + ρT−1 ) T−1∑ t=1 χt t ( T−2∏ k=t ρk ) , where ρt := ‖ft‖Ft⊗Ht Wt , χt := ∏t−1 k=1 Wk Wk and t is the FW error reported at time t by the algorithm: t := ‖µ(pt)− µ(pt)‖Ht . We note that χt ≈ 1 as we expect the errors on Wk to go in either direction, and thus to cancel each other over time (though in the worst case it could grow exponentially in t). If t ≤ and ρt ≤ ρ, we basically have ‖µ(pT ) − µ(pT )‖ = O(ρT ) if ρ > 1; O(T) if ρ = 1; and O() if ρ < 1 (a contraction). The exponential dependence in T is similar as for a standard particle filter for general distributions; see Douc et al. (2014) though for conditions to get a contraction for the PF. Importantly, for a fixed T it follows that the rates of convergence for Frank-Wolfe in N translates to rates of errors for integrals of functions in H with respect to the predictive distribution pT . Thus if we suppose that H is finite dimensional, that pt has full support on X for all t and that the kernel κ is continuous, then by Proposition 1 in Bach et al. (2012), we have that the faster rates for Frank-Wolfe hold and in particular we could obtain an error bound of O(1/N) with N particles. As far as we know, this is the first explicit... |

5 | Novel approach to nonlinear/non-Gaussian Bayesian state estimation. - Smith - 1993 |

4 | New Analysis and Results for the Frank-Wolfe Method. ArXiv e-prints,
- Freund, Grigas
- 2013
(Show Context)
Citation Context ...e note that the standard step-size for FrankWolfe optimization to get a O(1/k) rate is γk = 2 k+2 . 6 The best rate known for general objectives when using FW with γk = 1 k+1 is actually O(log(k)/k) (=-=Freund and Grigas, 2013-=-, Bound 3.2). We make use of the specific form of the MMD objective here to prove the O(1/k) rate, as well as the faster O(1/k2) rate under additional assumptions. For the rest of this section, we use... |

2 | Using random quasi-Monte-Carlo within particle filters, with application to financial time series.
- Fearnhead
- 2005
(Show Context)
Citation Context ...rticle at time t, whereas x (i) 1:t is the particle trajectory. While principally the PF provides an approximation of the full joint distribution rt(x1:t), it is well known that this approximation deteriorates for any marginal of xs for s t (Doucet and Johansen, 2011). Hence, the PF is typically only used to approximate marginals of xs for s . t (fixed-lag smoothing) or s = t (filtering), or for prediction. Algorithm 2 presents the bootstrap particle filtering algorithm (Gordon et al., 1993) from the point of view of propagating an approximate posterior distribution forward in time (see e.g. Fearnhead, 2005). We describe it as propagating an approximation pt(x1:t) of the joint predictive distribution one time step forward with the model dynamics to obtain pt+1(xt+1, x1:t) (step 5), and then randomly sampling from it (step 3) to get the new predictive approximation pt+1(xt+1, x1:t). As pt is an empirical distribution, pt+1 is a mixture distribution (the mixture components are coming from the particles at time t): pt+1(xt+1, x1:t) = 1 Wt N∑ i=1 p(yt|x(i)t )w (i) t︸ ︷︷ ︸ mixture weight p(xt+1|x(i)t )︸ ︷︷ ︸ mixture component δ x (i) 1:t (x1:t). (4) We denote the conditional normalization const... |

2 |
Sequential quasi-Monte Carlo. arXiv preprint arXiv:1402.4039v5,
- Gerber, Chopin
- 2014
(Show Context)
Citation Context .... One way to improve the particle filter is thus to replace the random sampling stage of step 3 with different sampling mechanisms with lower variance or better approximation properties of the distribution pt+1 that we are trying to approximate. As we obtain the normalization constants Wt by integrating the observation probability, it seems natural to look for particle point sets with better integration properties. By replacing random sampling with a quasi-random number sequence, we obtain the already proposed sequential quasi-Monte Carlo scheme (Philomin et al., 2000; Ormoneit et al., 2001; Gerber and Chopin, 2014). The 547 Simon Lacoste-Julien, Fredrik Lindsten, Francis Bach main contribution of our work is to instead propose to use Frank-Wolfe quadrature in step 3 of the particle filter to obtain better (adapted) point sets. 3.2 Sequential kernel herding In the sequential kernel herding (SKH) algorithm, we simply replace step 3 of Algorithm 2 with pt = FW-Quad(pt,Ht, N). As mentioned in the introduction, many dynamical models used in practice assume Gaussian transitions. Therefore, we will put particular emphasis on the case when (more generally) p(xt|x1:(t−1), y1:(t−1)) is a mixture of Gaussians, w... |

1 | Learning with submodular functions: A convex optimization perspective. Foundations and Trends - Lacoste-Julien, Lindsten, et al. - 2013 |