#### DMCA

## A new view of automatic relevance determination (2008)

### Cached

### Download Links

- [books.nips.cc]
- [dsp.ucsd.edu]
- [dsp.ucsd.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In NIPS 20 |

Citations: | 68 - 9 self |

### Citations

7181 | Convex Optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...hese types of problems, especially when structured dictionaries Φ are being used.2.1 Algorithm Derivation To start we note that the log-determinant term of L(γ) is concave in γ (see Section 3.1.5 of =-=[1]-=-), and so can be expressed as a minimum over upper-bounding hyperplanes via log |Σy| = min z z T γ − g ∗ (z), (7) where g ∗ (z) is the concave conjugate of log |Σy| that is defined by the duality rela... |

3989 | Regression shrinkage and selection via the lasso
- Tibshirani
- 1996
(Show Context)
Citation Context ...procedures can be used for the minimization required by Step 2. For example, one attractive option is to convert the problem to an equivalent least absolute shrinkage and selector operator or ‘Lasso’ =-=[14]-=- optimization problem according to the following: Lemma 2. The objective function in (11) can be minimized by solving the weighted convex ℓ1regularized cost function ‖y − Φx‖ 2 2 + 2λ ∑ x∗ = arg min x... |

1474 |
Linear and nonlinear programming
- Luenberger, Ye
- 2008
(Show Context)
Citation Context ...)) is guaranteed to converge monotonically to a local minimum (or saddle point) of (2). The proof is relatively straightforward and stems directly from the Global Convergence Theorem (see for example =-=[6]-=-). A sketch is as follows: First, it must be shown that the the mapping A(·) is compact. This condition is satisfied because if any element of γ is unbounded, L(γ) diverges to infinity. If fact, for a... |

947 | Sparse Bayesian learning and the relevance vector machine
- Tipping
- 2001
(Show Context)
Citation Context ...tively prunes away redundant or superfluous features [10]. Here we will describe a special case of ARD called sparse Bayesian learning (SBL) that has been very successful in a variety of applications =-=[15]-=-. Later in Section 4 we will address extensions to more general models. The basic ARD prior incorporated by SBL is p(x; γ) = N (x; 0, diag[γ]), where γ ∈ Rm + is a vector of m non-negative hyperperpar... |

791 | Bayesian Learning for Neural Networks
- Neal
- 1996
(Show Context)
Citation Context ...e determination (ARD) addresses this problem by regularizing the solution space using a parameterized, data-dependent prior distribution that effectively prunes away redundant or superfluous features =-=[10]-=-. Here we will describe a special case of ARD called sparse Bayesian learning (SBL) that has been very successful in a variety of applications [15]. Later in Section 4 we will address extensions to mo... |

715 | DJC: Bayesian Interpolation
- Mackay
- 1992
(Show Context)
Citation Context ...hese hyperparameters are estimated from the data by first marginalizing over the coefficients x and then performing what is commonly referred to as evidence maximization or type-II maximum likelihood =-=[7, 10, 15]-=-. Mathematically, this is equivalent to minimizing ∫ L(γ) � − log p(y|x)p(x; γ)dx = − log p(y; γ) ≡ log |Σy| + y T Σ −1 y y, (2) ∗ This research was supported by NIH grants R01DC04855 and R01DC006435.... |

462 | On the convergence properties of the EM algorithm - Wu - 1983 |

268 | Sparse solutions to linear inverse problems with multiple measurement vectors
- Cotter, Rao, et al.
- 2005
(Show Context)
Citation Context ...so show that, in certain settings, no λ-independent, factorial regularization term can achieve similar results. Consequently, the widely used family of ℓp quasi-norms, i.e., ‖x‖p � ∑ i |xi| p , p < 1 =-=[2]-=-, or the Gaussian entropy measure ∑ i log |xi| based on the Jeffreys prior [4] provably fail in this regard. 3.2 Benefits of λ dependency To explore the properties of h ∗ (x 2 ) regarding λ dependency... |

219 | Sparse signal reconstruction perspective for source localization with sensor arrays
- Malioutov, Cetin, et al.
- 2005
(Show Context)
Citation Context ...olved using the methodology described herein. The primary difference is that Step 2 becomes a second-order cone (SOC) optimization problem for which a variety of techniques exist for its minimization =-=[2, 9]-=-. Another very useful adaptation involves adding a non-negativity constraint on the coefficients x, e.g., non-negative sparse coding. This is easily incorporated into the MAP cost function (15) and op... |

120 |
Log-det heuristic for matrix rank minimization with applications to Hankel and Euclidean distance matrices
- Fazel, Hindi, et al.
- 2003
(Show Context)
Citation Context ... extended to handle additional constraints (e.g., non-negativity) or model complexity as occurs with general covariance component estimation. A related optimization strategy has also been reported in =-=[3]-=-.The analysis used in deriving this algorithm reveals that ARD is exactly equivalent to performing MAP estimation in x space using a principled, sparsity-inducing prior that is non-factorable and dep... |

115 | Fast marginal likelihood maximisation for sparse bayesian models
- Tipping, Faul
- 2003
(Show Context)
Citation Context ...y iteration and so early stopping is always feasible if desired. This produces a highly efficient, global competition among features that is potentially superior to the sequential (greedy) updates of =-=[16]-=- in terms of local minima avoidance in certain cases when Φ is highly overcomplete (i.e., m ≫ n). Moreover, it is also easily extended to handle additional constraints (e.g., non-negativity) or model ... |

87 | Comparison of approximate methods for handling hyperparameters
- MacKay
- 1999
(Show Context)
Citation Context ... this matter is not discussed in [16]. 3 Relating ARD to MAP Estimation In hierarchical models such as ARD and SBL there has been considerable debate over how to best perform estimation and inference =-=[8]-=-. Do we add a hyperprior and then integrate out γ and perform MAP estimation directly on x? Or is it better to marginalize over the coefficients x and optimize the hyperparameters γ as we have describ... |

82 | Wavelet shrinkage denoising using the non-negative garrote
- Gao
- 1998
(Show Context)
Citation Context ... As a final point of comparison, the actual weight estimate obtained from solving (15) when Φ T Φ = I is equivalent to the non-negative garrote estimator that has been advocated for wavelet shrinkage =-=[5, 18]-=-. PSfrag replacements − log p(x) (normalized) α ARD P i |xi| 0.01 − log p(xi) 2 1.8 1.6 1.4 1.2 2 1.6 1.2 0.8 0.4 I[x i ̸= 0] |x i| ARD 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 xi PSfrag replacements xi − log ... |

47 | Adaptive sparseness using Jeffreys prior
- Figueiredo
- 2001
(Show Context)
Citation Context ...term can achieve similar results. Consequently, the widely used family of ℓp quasi-norms, i.e., ‖x‖p � ∑ i |xi| p , p < 1 [2], or the Gaussian entropy measure ∑ i log |xi| based on the Jeffreys prior =-=[4]-=- provably fail in this regard. 3.2 Benefits of λ dependency To explore the properties of h ∗ (x 2 ) regarding λ dependency alone, we adopt the simplifying assumption Φ T Φ = I. (Later we investigate t... |

28 | Recovery of jointly sparse signals from few random projections
- Wakin, Sarvotham, et al.
- 2005
(Show Context)
Citation Context ...ularly useful ARD-based model. But much of the analysis can be extended to handle a variety of alternative data likelihoods and priors. A particularly useful adaptation relevant to compressed sensing =-=[17]-=-, manifold learning [13], and neuroimaging [12, 18] is as follows. First, the data y can be replaced with a n × t observation matrix Y which is generated via an unknown coefficient matrix X. The assum... |

20 | Bayesian Methods for Finding Sparse Representations
- Wipf
- 2006
(Show Context)
Citation Context ...resembles a scaled version of the ℓ1 norm. The implicit ARD prior naturally handles this transition becoming sparser as λ decreases and vice versa. Hence the following property, which is easy to show =-=[18]-=-: Lemma 3. When Φ T Φ = I, (15) has no local minima whereas (17) has 2 M local minima. Use of the ℓ1 norm in place of h ∗ (x 2 ) also yields no local minima; however, it is a much looser approximation... |

19 | Selecting landmark points for sparse manifold learning
- Silva, Marques, et al.
- 2005
(Show Context)
Citation Context ...model. But much of the analysis can be extended to handle a variety of alternative data likelihoods and priors. A particularly useful adaptation relevant to compressed sensing [17], manifold learning =-=[13]-=-, and neuroimaging [12, 18] is as follows. First, the data y can be replaced with a n × t observation matrix Y which is generated via an unknown coefficient matrix X. The assumed likelihood model and ... |

7 |
Neuromagnetic source imaging of spontaneous and evoked human brain dynamics
- Ramirez
- 2005
(Show Context)
Citation Context ...analysis can be extended to handle a variety of alternative data likelihoods and priors. A particularly useful adaptation relevant to compressed sensing [17], manifold learning [13], and neuroimaging =-=[12, 18]-=- is as follows. First, the data y can be replaced with a n × t observation matrix Y which is generated via an unknown coefficient matrix X. The assumed likelihood model and prior are ( p(Y |X) ∝ exp −... |

5 | Wavelet footprints and sparse bayesian learning for DNA copy number change analysis
- Pique-Regi
- 2007
(Show Context)
Citation Context ...enting drastically fewer local minima than competing priors. This might possibly explain the superior performance of ARD/SBL over Lasso in a variety of disparate disciplines where sparsity is crucial =-=[11, 12, 18]-=-. These ideas raise a key question: If we do not limit ourselves to factorable, Φ- and λ-independent regularization terms/priors as is commonly done, then what is the optimal prior p(x) in the context... |