#### DMCA

## regression. Submitted to the Electronic Journal of Statistics. 2013. <hal-00846715> (2013)

### Citations

7738 |
Matrix Analysis
- Horn, Johnson
- 1990
(Show Context)
Citation Context ...ds to the fixed-design estimator f̂M = AMy ∈ Rnp , with AM = AM,K := K̃M (K̃M + npInp) −1 = (M−1 ⊗K) ((M−1 ⊗K) + npInp)−1 , where ⊗ denotes the Kronecker product (see the textbook of Horn and Johnson =-=[16]-=- for simple properties of the Kronecker product). Remark 3. This setting also captures the single-task setting. Taking j ∈ {1, . . . , p}, f j = (f j(X1), . . . , f j(Xn)) ⊤ being the target-signal fo... |

1297 |
Theory of reproducing kernels
- Aronszajn
- 1950
(Show Context)
Citation Context ...t task always being stored in the first entries of the vector, and so on. We want to estimate f using elements of a particular function set. Let F ⊂ L2(P) be a reproducing kernel Hilbert space (RKHS) =-=[4]-=-, with kernel k and feature map Φ : X → F , which give us the positive semidefinite kernel matrix K = (k(Xi, Xℓ))1≤i,ℓ≤n ∈ S+n (R). As done by Solnon et al. [27] we extend the multi-task estimators ge... |

677 | Multitask learning.
- Caruana
- 1997
(Show Context)
Citation Context ...s on the multi-task setting. Advantages of the multi-task procedure over the single task one were first shown experimentally in various situations by, for instance, Thrun and O’Sullivan [29], Caruana =-=[11]-=- or Bakker and Heskes [6]. For classification, Ben-David and Schuller [8] compare upper bounds on multi-task and single-task classification errors, and showed that the multi-task estimator could, in s... |

483 |
Estimation with quadratic loss,
- James, Stein
- 1961
(Show Context)
Citation Context ...that uniformly attains a lower quadratic risk by shrinking the estimators along the different dimensions towards an arbitrary point. An explicit form of such an estimator was given by James and Stein =-=[19]-=-, yielding the famous James-Stein estimator. This phenomenon, now known as the “Stein’s paradox”, was widely studied in the following years and the behaviour of this estimator was confirmed by empiric... |

443 | A framework for learning predictive structures from multiple tasks and unlabeled data.
- Ando, Zhang
- 2005
(Show Context)
Citation Context ... linear set up (also known as group lasso) by Obozinski et al. [25] and Lounici et al. [23], in multiple kernel learning by Koltchinskii and Yuan [21] or in semi-supervised learning by Ando and Zhang =-=[1]-=-. The kernel version of this was also studied [2, 18], a convex relaxation leading to a trace norm regularization and allowing the calibration of parameters. Another point of view was brought by Ben-D... |

343 | Inadmissibility of the usual estimator for the mean of a multivariate distribution. - Stein - 1956 |

251 | Learning multiple tasks with kernel methods.
- Evgeniou, CA, et al.
- 2005
(Show Context)
Citation Context .... . . . 45 1. Introduction Increasing the sample size is the most common way to improve the performance of statistical estimators. In some cases (see, for instance, the experiments of Evgeniou et al. =-=[13]-=- on customer data analysis or those of Jacob et al. [18] on molecule binding problems), having access to some new data may be impossible, often due to experimental limitations. One way to circumvent t... |

209 |
Smoothing Spline ANOVA Models.
- Gu
- 2002
(Show Context)
Citation Context ...n equality, although the equivalence ≍ is only needed. Example 1. This example, related to Assumptions (HAV(δ, C1, C2)) and (HK(β)) by taking β = m and 2δ = k + 2, is detailed by Wahba [30] and by Gu =-=[15]-=-. Let P (2π) the set of all square-integrable 2π-periodic functions on R, m ∈ N⋆ and define H = { f ∈ P (2π) , f (m)|[0,2π] ∈ L2 [0, 2π] } . This set H has a RKHS structure, with a reproducing kernel ... |

193 | A model of inductive bias learning.
- Baxter
- 2000
(Show Context)
Citation Context ...d single-task oracles 4 problems being similar if, given a group of permutations of the input set, a dataset of the one can be permuted in a dataset of the other. They followed the analysis of Baxter =-=[7]-=-, which shows very general bounds on the risk of a multitask estimator in a model-selection framework, the sets of all models reflecting the insight the statistician has on the multi-task setting. Adv... |

189 |
Splines Models of Observational Data, volume 59
- Wahba
- 1990
(Show Context)
Citation Context ...C1, C2)) hold in equality, although the equivalence ≍ is only needed. Example 1. This example, related to Assumptions (HAV(δ, C1, C2)) and (HK(β)) by taking β = m and 2δ = k + 2, is detailed by Wahba =-=[30]-=- and by Gu [15]. Let P (2π) the set of all square-integrable 2π-periodic functions on R, m ∈ N⋆ and define H = { f ∈ P (2π) , f (m)|[0,2π] ∈ L2 [0, 2π] } . This set H has a RKHS structure, with a repr... |

154 | Task clustering and gating for bayesian multitask learning.
- Bakker, Heskes
- 2003
(Show Context)
Citation Context ...g. Advantages of the multi-task procedure over the single task one were first shown experimentally in various situations by, for instance, Thrun and O’Sullivan [29], Caruana [11] or Bakker and Heskes =-=[6]-=-. For classification, Ben-David and Schuller [8] compare upper bounds on multi-task and single-task classification errors, and showed that the multi-task estimator could, in some settings, need less t... |

114 | Exploiting task relatedness for multiple task learning. In:
- Ben-David, Schuller
- 2003
(Show Context)
Citation Context ...on of this was also studied [2, 18], a convex relaxation leading to a trace norm regularization and allowing the calibration of parameters. Another point of view was brought by Ben-David and Schuller =-=[8]-=-, defining a multi-task framework in classification, two classification M. Solnon/Comparison between multi-task and single-task oracles 4 problems being similar if, given a group of permutations of th... |

111 | Discovering Structure in Multiple Learning Tasks: The TC Algorithm,”
- O’Sullivan, Thrun
- 1996
(Show Context)
Citation Context ...tatistician has on the multi-task setting. Advantages of the multi-task procedure over the single task one were first shown experimentally in various situations by, for instance, Thrun and O’Sullivan =-=[29]-=-, Caruana [11] or Bakker and Heskes [6]. For classification, Ben-David and Schuller [8] compare upper bounds on multi-task and single-task classification errors, and showed that the multi-task estimat... |

85 | Clustered multi-task learning: A convex formulation.
- Jacob, Vert, et al.
- 2009
(Show Context)
Citation Context ... the most common way to improve the performance of statistical estimators. In some cases (see, for instance, the experiments of Evgeniou et al. [13] on customer data analysis or those of Jacob et al. =-=[18]-=- on molecule binding problems), having access to some new data may be impossible, often due to experimental limitations. One way to circumvent those constraints is to use datasets from several related... |

78 | Support union recovery in high-dimensional multivariate regression.
- OBOZINSKI, WAINWRIGHT, et al.
- 2011
(Show Context)
Citation Context ...s that the functions all share a few common features, and can be expressed by a similar regularization term. This idea was expressed in a linear set up (also known as group lasso) by Obozinski et al. =-=[25]-=- and Lounici et al. [23], in multiple kernel learning by Koltchinskii and Yuan [21] or in semi-supervised learning by Ando and Zhang [1]. The kernel version of this was also studied [2, 18], a convex ... |

74 |
Concentration Inequalities and Model Selection: Ecole d’ete de Probabilites de Saint-Flour 23.
- Massart
- 2003
(Show Context)
Citation Context ...andom variables (Bi)i∈{1,...,N} follow a Bernouilli distribution of parameter P (R⋆MT < R ⋆ ST), we can apply Hoeffding’s inequality M. Solnon/Comparison between multi-task and single-task oracles 28 =-=[24]-=- and see that, for every ε > 0, [B̄N − ε, 1] is a confidence interval of level 1− e−2Nε2 for P (R⋆MT < R⋆ST). This leads to the following p-value: π1 = { e−2N(B̄N−0.5) 2 if B̄N ≥ 0.5 , 0 otherwise . I... |

70 |
Estimation of high-dimensional low rank matrices.
- Rohde, Tsybakov
- 2011
(Show Context)
Citation Context ... showed that the multi-task estimator could, in some settings, need less training data to reach the same upper bounds. The low dimensional linear regression setting was analysed by Rohde and Tsybakov =-=[26]-=-, who showed that, under sparsity assumptions, restricted isometry conditions and using the tracenorm regularization, the multi-task estimator achieves the rates of a single-task estimator with a np-s... |

67 |
Oracle inequalities and optimal inference under group sparsity.
- Lounici, Pontil, et al.
- 2011
(Show Context)
Citation Context ... share a few common features, and can be expressed by a similar regularization term. This idea was expressed in a linear set up (also known as group lasso) by Obozinski et al. [25] and Lounici et al. =-=[23]-=-, in multiple kernel learning by Koltchinskii and Yuan [21] or in semi-supervised learning by Ando and Zhang [1]. The kernel version of this was also studied [2, 18], a convex relaxation leading to a ... |

64 |
Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature learning.
- Argyriou
- 2008
(Show Context)
Citation Context ...zinski et al. [25] and Lounici et al. [23], in multiple kernel learning by Koltchinskii and Yuan [21] or in semi-supervised learning by Ando and Zhang [1]. The kernel version of this was also studied =-=[2, 18]-=-, a convex relaxation leading to a trace norm regularization and allowing the calibration of parameters. Another point of view was brought by Ben-David and Schuller [8], defining a multi-task framewor... |

62 |
Stein’s paradox in statistics.
- Efron, Morris
- 1977
(Show Context)
Citation Context ...enon, now known as the “Stein’s paradox”, was widely studied in the following years and the behaviour of this estimator was confirmed by empirical studies, in particular the one from Efron and Morris =-=[12]-=-. This first example clearly shows the goals of the multi-task procedure: an advantage is gained by borrowing information from different tasks (here, by shrinking the estimators along the different di... |

54 |
Sparsity in multiple kernel learning,
- Koltchinskii, Yuan
- 2010
(Show Context)
Citation Context ...ilar regularization term. This idea was expressed in a linear set up (also known as group lasso) by Obozinski et al. [25] and Lounici et al. [23], in multiple kernel learning by Koltchinskii and Yuan =-=[21]-=- or in semi-supervised learning by Ando and Zhang [1]. The kernel version of this was also studied [2, 18], a convex relaxation leading to a trace norm regularization and allowing the calibration of p... |

53 |
Optimal rates for the regularized least-squares algorithm.
- Caponnetto, Vito
- 2007
(Show Context)
Citation Context ...al functions, which have the form of the risk of a kernel ridge estimator. The risk of those estimators has already been widely studied. Johnstone [20] (see also the article of Caponnetto and De Vito =-=[10]-=- for random design) showed that, for a single-task ridge estimator, if the coefficients of the decomposition of the input function on the eigenbasis of the kernel decrease as i−2δ, with 2δ > 1, then t... |

45 | Minimax Bayes, asymptotic minimax and sparse wavelet priors. Statistical decision theory and related topics.
- Johnstone
- 1994
(Show Context)
Citation Context ...in the multi-task risk, we just had to optimize several functions, which have the form of the risk of a kernel ridge estimator. The risk of those estimators has already been widely studied. Johnstone =-=[20]-=- (see also the article of Caponnetto and De Vito [10] for random design) showed that, for a single-task ridge estimator, if the coefficients of the decomposition of the input function on the eigenbasi... |

27 | Data-driven calibration of linear estimators with minimal penalties
- Arlot, Bach
- 2011
(Show Context)
Citation Context ...s prefer the second formulation and use the matrices MAV instead of the matrices MSD. 3. Decomposition of the risk A fully data-driven selection of the hyper-parameters was proposed by Arlot and Bach =-=[3]-=-, for the single-task ridge estimator, and by Solnon et al. [27] for the multi-task estimator. The single-task estimator is shown to have a risk which is close to the single-task oracle-risk (with a f... |

16 |
Adaptive multivariate ridge regression.
- Brown, Zidek
- 1980
(Show Context)
Citation Context ...wer to this question. The multi-task regression setting, which could also be called “multivariate regression”, has already been studied in different papers. It was first introduced by Brown and Zidek =-=[9]-=- in the case of ridge regression, and then adapted by Evgeniou et al. [13] in its kernel form. Another view of the meaning of “task similarity” is that the functions all share a few common features, a... |

13 | Sharp analysis of low-rank kernel matrix approximations.
- Bach
- 2013
(Show Context)
Citation Context ...minimax rates for the estimation of this imput function is of order n1/2δ−1. The kernel ridge estimator is then known to be minimax optimal, under certain regularity assumptions (see the work of Bach =-=[5]-=- for more details). If the eigenvalues of the kernel are known to decrease as i−2β, then a single-task ridge estimator is minimax optimal under the following assumption: 1 < 2δ < 4β + 1 . (HM(β, δ)) T... |

12 | An analysis of random design linear regression
- Hsu, Kakade, et al.
- 2011
(Show Context)
Citation Context ...t points as fixed and want to predict the output of the functions F j on those input points only. The analysis could be transfered to the random-design setting by using tools developped by Hsu et al. =-=[17]-=-. For an estimator (F̂ 1, . . . , F̂ p), the natural quadratic risk to consider is E 1 np p∑ j=1 n∑ i=1 (F̂ j(Xi)− F j(Xi))2|(X1, . . . , Xn) . For the sake of simplicity, all the expectations t... |

10 | Asymptotically optimal regularization in smooth parametric models. Neural Information Processing Systems,
- Liang, Bach, et al.
- 2009
(Show Context)
Citation Context ...nder sparsity assumptions, restricted isometry conditions and using the tracenorm regularization, the multi-task estimator achieves the rates of a single-task estimator with a np-sample. Liang et al. =-=[22]-=- also obtained a theoretical criterion, applicable to the linear regression setting and unfortunately non observable, which tells when the multi-task estimator asymptotically has a lower risk than the... |

8 | Multi-task regression using minimal penalties.
- Solnon, Arlot, et al.
- 2012
(Show Context)
Citation Context ...ifferent tasks together. One of the main questions that is asked is to assert whether the multi-task estimator has a lower risk than any single-task estimator. It was recently proved by Solnon et al. =-=[27]-=- that a fully data-driven calibration of this procedure is possible, given some assumptions on the set of matrices used to regularize—which correspond to prior knowledge on the tasks. Under those assu... |

6 | Multi-task averaging. - Feldman, Gupta, et al. - 2012 |