#### DMCA

## The complexity of large-scale convex programming under a linear optimization oracle. (2013)

Citations: | 11 - 1 self |

### Citations

670 |
Simulation and the Monte Carlo Method
- Rubinstein
- 1981
(Show Context)
Citation Context .... The intuition underlying such a approach is that convolving two functions yields a new function that is at least as smooth as the smoother one of the original two functions. In particular, let µ denote the density of a random variable with respect to Lebesgue measure and consider the function fµ given by fµ(x) := (f ∗ µ)(x) = ∫ Rn f(y)µ(x− y)d(y) = Eµ[x+ Z], where Z is a random variable with density µ. Since µ is a density with respect to Lebesgue measure, fµ is differentiable [5]. The above convolution-based smoothing technique has been extensively studied in stochastic optimization, e.g., [5,11,20,30,31]. For the sake of simplicity, we assume throughout this subsection that ‖ · ‖ = ‖ · ‖2 and Z is uniformly distributed over a certain Euclidean ball. The following result is known in the literature (see, e.g., [11]). Lemma 3 Let ξ be uniformly distributed over the l2-ball B2(0, 1) := {x ∈ Rn : ‖x‖2 ≤ 1} and u > 0 is given. Suppose that (1.4) holds for any x, y ∈ X + uB2(0, 1). Then, the following statements hold for the function fu(·) given by fu(x) := Eµ[f(x+ uξ)]. (3.24) a) f(x) ≤ fu(x) ≤ f(x) +Mu; b) fu(x) has M √ n/u-Lipschitz continuous gradient with respect to ‖ · ‖2; c) E[f ′(x+ uξ)] = f... |

539 | Introductory lectures on convex optimization: a basic course - Nesterov - 2003 |

519 | Smooth minimization of non-smooth functions
- Nesterov
(Show Context)
Citation Context ...oint x ∈ X s.t. f(x) − f∗ ≤ , cannot be smaller than O(1/2) if n is sufficiently large. In addition, if f is a general smooth convex function satisfying ‖f ′(x)− f ′(y)‖∗ ≤ L‖x− y‖, ∀x, y ∈ X, (1.5) then the number of iterations required by any first-order methods to find an -solution of (1.1) cannot be smaller than O(1/ √ ) if n is large enough. These lower complexity bounds can be achieved, for example, by the aforementioned subgradient (mirror) descend method and Nesterov’s method, respectively, for nonsmooth and smooth convex optimization. In addition, in a recent breakthrough paper [28], Nesterov studied an important class of saddle point problems with f is given by f(x) = max y∈Y { 〈Ax, y〉 − f(y) } . (1.6) Here Y ⊆ Rm is a convex compact set, A : Rn → Rm a linear operator and f : Y → R is a simple convex function. Although f given by (1.6) is nonsmooth in general, Nesterov showed that it can be closely approximated by a smooth function. Accordingly, he devised a novel smoothing scheme that can achieve the O(1/) for solving this class of saddle point problems. However, under the assumption that only linear subproblems given in the form of (1.2) (rather than (1.3)) are all... |

491 | A relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming - Bregman - 1967 |

335 |
Problem complexity and method efficiency in optimization
- Nemirovsky, Yudin
- 1983
(Show Context)
Citation Context ...ER Award CMMI-1254446. Department of Industrial and Systems Engineering, University of Florida, Gainesville, FL, 32611. (email: glan@ise.ufl.edu). Address(es) of author(s) should be given 2 solution of subproblems given in the form of Argminx∈X〈p, x〉. (1.2) In particular, if p is computed based on first-order information, then we call these algorithms first-order LCP methods. Clearly, the difference between first-order LCP methods and the more general first-order methods exists in the restrictions on the format of subproblems. For example, in the well-known subgradient (mirror) descent method [25] and Nesterov’s method [26,27], we solve the projection (or prox-mapping) subproblems given in the form of argminx∈X {〈p, x〉+ d(x)} . (1.3) Here d : X → R is a certain strongly convex function (e.g., d(x) = ‖x‖22/2). The development of LCP methods dates back to the conditional gradient (CG) method (a.k.a., Frank-Wolfe algorithm) developed by Frank and Wolfe in 1956 [13]. This method has recently regained some interests from both optimization and machine learning community (see, e.g., [1–3,7,17–19,23,32,33]) mainly for the following reasons. – Low iteration cost. In many cases, the solution of ... |

309 |
An algorithm for quadratic programming
- Frank, Wolfe
- 1956
(Show Context)
Citation Context ...thods. Clearly, the difference between first-order LCP methods and the more general first-order methods exists in the restrictions on the format of subproblems. For example, in the well-known subgradient (mirror) descent method [25] and Nesterov’s method [26,27], we solve the projection (or prox-mapping) subproblems given in the form of argminx∈X {〈p, x〉+ d(x)} . (1.3) Here d : X → R is a certain strongly convex function (e.g., d(x) = ‖x‖22/2). The development of LCP methods dates back to the conditional gradient (CG) method (a.k.a., Frank-Wolfe algorithm) developed by Frank and Wolfe in 1956 [13]. This method has recently regained some interests from both optimization and machine learning community (see, e.g., [1–3,7,17–19,23,32,33]) mainly for the following reasons. – Low iteration cost. In many cases, the solution of the linear subproblem (1.2) is much easier to solve than the nonlinear subproblem (1.3). For example, if X is a spectahedron given by X = {x ∈ Rn×n : Tr(x) = 1, x 0}, the solution of (1.2) can be much faster than that of (1.3). – Simplicity. The CG method is simple to implement since it does not require the selection of the distance function d(x) in (1.3) and the fine... |

267 | 2009] Robust stochastic approximation approach to stochastic programming
- Nemirovski, Juditsky, et al.
(Show Context)
Citation Context ... (CP) models for machine learning, image processing, and polynomial optimization, etc. The CP problems arising from these applications, however, are often of high dimension and hence challenging to solve. In particular, they are generally beyond the capability of second-order interior-point methods due to the highly demanding iteration costs of these optimization techniques. This has motivated the currently active research on first-order methods which possess cheaper iteration costs for largescale CP, including Nesterov’s optimal method [26–28] and several stochastic first-order algorithms in [24,21]. These optimization algorithms are relatively simple, and suitable for the situation when low or moderate solution accuracy is sought-after. In this paper, we study a different class of optimization algorithms, referred to as linear-optimization-based convex programming (LCP) methods, for large-scale CP. Specifically, consider the CP problem of f∗ := min x∈X f(x), (1.1) where X ⊆ Rn is a convex compact set and f : X → R is a closed convex function. The LCP methods solve problem (1.1) by iteratively calling a linear optimization (LO) oracle, which, for a given input vector p ∈ Rn, computes the... |

180 |
On accelerated proximal gradient methods for convex-concave optimization.
- Tseng
- 2008
(Show Context)
Citation Context ... = ∑k i=1 θi. Call the LO oracle to compute xk ∈ Argminx∈X〈pk, x〉. Set yk = (1− αk)yk−1 + αkxk for some αk ∈ [0, 1]. end for Clearly, the above PDA-CG method is also a special LCP algorithm. While the input vector pk to the LO oracle is set to f ′(zk−1) in the PA-CG method in the previous subsection, the vector pk in the PDA-CG method is defined as a weighted average of f ′(zi−1), i = 1, . . . , k, for some properly chosen weights θi, i = 1, . . . , k. This algorithm can also be viewed as the projection-free version of an ∞-memory variant of Nesterov’s accelerated gradient method as stated in [28,34]. Note that by convexity of f , the function Ψk(x) given by Ψk(x) := { 0, k = 0, Θ−1k ∑k i=1 θilf (zi−1;x), k ≥ 1, (4.6) underestimates f(x) for any x ∈ X. In particular, by the definition of xk in Algorithm 5, we have Ψk(xk) ≤ Ψk(x) ≤ f(x), ∀x ∈ X, (4.7) and hence Ψk(xk) provides a lower bound on the optimal value f ∗ of problem (1.1). In order to establish the convergence of the PDA-CG method, we first need to show a simple technical result about Ψk(xk). Lemma 4 Let {xk} and {zk} be the two sequences computed by the PDA-CG method. We have θk lf (zk−1;xk) ≤ ΘkΨk(xk)−Θk−1Ψk−1(xk−1), k = 1, 2, ... |

109 |
A method for unconstrained convex minimization problem with the rate of convergence O
- Nesterov
- 1983
(Show Context)
Citation Context ...rtment of Industrial and Systems Engineering, University of Florida, Gainesville, FL, 32611. (email: glan@ise.ufl.edu). Address(es) of author(s) should be given 2 solution of subproblems given in the form of Argminx∈X〈p, x〉. (1.2) In particular, if p is computed based on first-order information, then we call these algorithms first-order LCP methods. Clearly, the difference between first-order LCP methods and the more general first-order methods exists in the restrictions on the format of subproblems. For example, in the well-known subgradient (mirror) descent method [25] and Nesterov’s method [26,27], we solve the projection (or prox-mapping) subproblems given in the form of argminx∈X {〈p, x〉+ d(x)} . (1.3) Here d : X → R is a certain strongly convex function (e.g., d(x) = ‖x‖22/2). The development of LCP methods dates back to the conditional gradient (CG) method (a.k.a., Frank-Wolfe algorithm) developed by Frank and Wolfe in 1956 [13]. This method has recently regained some interests from both optimization and machine learning community (see, e.g., [1–3,7,17–19,23,32,33]) mainly for the following reasons. – Low iteration cost. In many cases, the solution of the linear subproblem (1.2) is... |

86 | Revisiting frank-wolfe: Projection-free sparse convex optimization.
- Jaggi
- 2013
(Show Context)
Citation Context ...th f satisfying assumption (1.5). Our goal is to derive a lower bound on the number of iterations required by any LCP methods for solving this class of problems. The complexity analysis has been an important topic in convex programming (see Nemirovski and Yudin [25], and Nesterov [27]). However, the study on the complexity for LCP methods is quite limited. Existing results focus on a specific algorithm, namely the classic CG method. More specifically, in 1968, Canon and Cullum [6] proved an asymptotic lower bound of Ω(1/k1+µ), for any µ > 0, on the rate of convergence for the CG method. Jaggi [18] revisited this algorithm and established a lower bound on the number of iteration performed by this algorithm for finding an approximate solution with certain sparse pattern. Similarly to the classic complexity analysis for CP in [25,27], we assume that the LO oracle used in the LCP algorithm is resisting, implying that: i) the LCP algorithm does not know how the solution of (1.2) is computed; and ii) in the worst case, the LO oracle provides the least amount of information for the LCP algorithm to solve problem (1.1). Using this assumption, we will construct a class of worst-case instances i... |

84 | Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm - Clarkson |

67 | An optimal method for stochastic composite optimization.
- Lan
- 2012
(Show Context)
Citation Context ... (CP) models for machine learning, image processing, and polynomial optimization, etc. The CP problems arising from these applications, however, are often of high dimension and hence challenging to solve. In particular, they are generally beyond the capability of second-order interior-point methods due to the highly demanding iteration costs of these optimization techniques. This has motivated the currently active research on first-order methods which possess cheaper iteration costs for largescale CP, including Nesterov’s optimal method [26–28] and several stochastic first-order algorithms in [24,21]. These optimization algorithms are relatively simple, and suitable for the situation when low or moderate solution accuracy is sought-after. In this paper, we study a different class of optimization algorithms, referred to as linear-optimization-based convex programming (LCP) methods, for large-scale CP. Specifically, consider the CP problem of f∗ := min x∈X f(x), (1.1) where X ⊆ Rn is a convex compact set and f : X → R is a closed convex function. The LCP methods solve problem (1.1) by iteratively calling a linear optimization (LO) oracle, which, for a given input vector p ∈ Rn, computes the... |

54 | Proximal minimization methods with generalized Bregman functions - Kiwiel - 1997 |

49 | A simple algorithm for nuclear norm regularized problems. - Jaggi, Sulovsk - 2010 |

48 | Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework.
- Ghadimi, Lan
- 2012
(Show Context)
Citation Context ...specifically, we assume throughout this subsection that we have access to an enhanced LO oracle, which can solve optimization problems given in the form of min {〈p, x〉 : x ∈ X, ‖x‖ ≤ R} . (3.33) For example, we can assume that the norm ‖ · ‖ is chosen such that problem (3.33) is relatively easy to solve. In particular, if X is a polytope, we can set ‖ · ‖ = ‖ · ‖∞ or ‖ · ‖ = ‖ · ‖1 and then the complexity to solve (3.33) will be comparable to the one to solve (1.2). Note however, that such a selection of ‖ · ‖ will possibly increase the value of the condition number given by L/µ. Motivated by [15], we present a shrinking CG method under the above assumption on the enhanced LO oracle1. Algorithm 3 The Shrinking Conditional Gradient (CG) Method Let p0 ∈ X be given. Set R0 = DX . for t = 1, . . . do Set y0 = pt−1. for k = 1, . . . , 8L/µ do Call the enhanced LO oracle to compute xk ∈ Argminx∈Xt−1 〈f ′(yk−1), x〉, where Xt−1 := {x ∈ X : ‖x− pt−1‖ ≤ Rt−1}. Set yk = (1− αk)yk−1 + αkxk for some αk ∈ [0, 1]. end for Set pt = yk and Rt = Rt−1/ √ 2; end for 1 We recently notice that Garber and Hanzan [14] have made some interesting development for CG methods applied to strongly convex problems It... |

47 | Interior gradient and proximal methods for convex and conic optimization, - Auslender, Teboulle - 2006 |

44 | Random gradient-free minimization of convex functions.
- Nesterov
- 2010
(Show Context)
Citation Context .... The intuition underlying such a approach is that convolving two functions yields a new function that is at least as smooth as the smoother one of the original two functions. In particular, let µ denote the density of a random variable with respect to Lebesgue measure and consider the function fµ given by fµ(x) := (f ∗ µ)(x) = ∫ Rn f(y)µ(x− y)d(y) = Eµ[x+ Z], where Z is a random variable with density µ. Since µ is a density with respect to Lebesgue measure, fµ is differentiable [5]. The above convolution-based smoothing technique has been extensively studied in stochastic optimization, e.g., [5,11,20,30,31]. For the sake of simplicity, we assume throughout this subsection that ‖ · ‖ = ‖ · ‖2 and Z is uniformly distributed over a certain Euclidean ball. The following result is known in the literature (see, e.g., [11]). Lemma 3 Let ξ be uniformly distributed over the l2-ball B2(0, 1) := {x ∈ Rn : ‖x‖2 ≤ 1} and u > 0 is given. Suppose that (1.4) holds for any x, y ∈ X + uB2(0, 1). Then, the following statements hold for the function fu(·) given by fu(x) := Eµ[f(x+ uξ)]. (3.24) a) f(x) ≤ fu(x) ≤ f(x) +Mu; b) fu(x) has M √ n/u-Lipschitz continuous gradient with respect to ‖ · ‖2; c) E[f ′(x+ uξ)] = f... |

40 | Large-scale convex minimization with a low-rank constraint. arXiv preprint arXiv:1106.1622 - Shalev-Shwartz, Gonen, et al. - 2011 |

39 | Sparse approximate solutions to semidefinite programs. - Hazan - 2008 |

36 | Non-Euclidean restricted memory level method for large-scale convex optimization.
- Ben-Tal, Nemirovski
- 2005
(Show Context)
Citation Context ...he following reasons. – Low iteration cost. In many cases, the solution of the linear subproblem (1.2) is much easier to solve than the nonlinear subproblem (1.3). For example, if X is a spectahedron given by X = {x ∈ Rn×n : Tr(x) = 1, x 0}, the solution of (1.2) can be much faster than that of (1.3). – Simplicity. The CG method is simple to implement since it does not require the selection of the distance function d(x) in (1.3) and the fine-tuning of stepsizes, which are required in most other first-order methods (with exceptions to some extent for a few level-type first-order methods, see [4,22]). – Structural properties for the generated solutions. The output solutions of the CG method may have certain desirable structural properties, e.g., sparsity and low rank, as they can often be written as the convex combination of a small number of extreme points of X. Numerical studies (e.g., [17]) indicate that the CG method can be competitive to the more involved gradient-type methods for solving certain classes of CP problems. It is also worth noting that the CG method, when applied to the linear feasibility problems, is closely related to the von Neumann algorithm studied by Dantzig [8,9]... |

25 | Condition number complexity of an elementary algorithm for computing a reliable solution of a conic linear system.
- Epelman, Freund
- 2000
(Show Context)
Citation Context ... generated solutions. The output solutions of the CG method may have certain desirable structural properties, e.g., sparsity and low rank, as they can often be written as the convex combination of a small number of extreme points of X. Numerical studies (e.g., [17]) indicate that the CG method can be competitive to the more involved gradient-type methods for solving certain classes of CP problems. It is also worth noting that the CG method, when applied to the linear feasibility problems, is closely related to the von Neumann algorithm studied by Dantzig [8,9], and later in Epelman and Freund [12]. This paper focuses on the complexity analysis of CP under an LO oracle, as well as the development of new LCP methods for large-scale CP. Although there exists rich complexity theory for the general first-order methods for largescale CP in the literature, the study on the complexity of CP under an LO oracle is still limited. More specifically, in view of the classic CP complexity theory [25,27], if f is a general nonsmooth Lipschitz continuous convex function such that |f(x)− f(y) |≤M‖x− y‖, ∀x, y ∈ X, (1.4) then the number of iterations required by any first-order methods to find an -solut... |

20 | Randomized smoothing for stochastic optimization.
- Duchi, Bartlett, et al.
- 2012
(Show Context)
Citation Context ...1.8) after properly smoothing the objective function. Note that, although a similar bound has been developed in [7], the optimality of this bound has not yet been established. In addition, the smoothing technique developed here is slightly different from those in [28,7] as we do not require explicit knowledge of DX , DY and the target accuracy given in advance. c) If f is a general nonsmooth function satisfying (1.4), we show that the CG method can achieve a nearly optimal complexity bound in terms of its dependence on after properly incorporating the randomized smoothing technique (e.g., [11]). In particular, by applying this method to the bilinear saddle point problems with f given by (1.6), we obtain an first-order algorithm which only requires linear optimization in both primal and dual space to solve this class of problems. It appears to us that no such techniques have been presented before in the literature (see discussions in Section 1 of [29]). d) We also discuss the possibility to improve the complexity of the CG method under strong convexity assumption about f(·) and with an enhanced LO oracle. Thirdly, we present a few new LCP methods, namely the primal averaging CG (PA-... |

20 | Some comments on Wolfe’s ‘away step - Guélat, Marcotte - 1986 |

19 | On the equivalence between herding and conditional gradient algorithms. - Bach, Lacoste-Julien, et al. - 2012 |

19 | Bregman monotone optimization algorithms. - Bauschke, Borwein, et al. - 2003 |

18 | Stochastic optimization problems with nondifferentiable cost functionals.
- Bertsekas
- 1973
(Show Context)
Citation Context ...e basic idea is to approximate the general nonsmooth CP problems F0M,‖·‖(X) by using the convolution-based smoothing. The intuition underlying such a approach is that convolving two functions yields a new function that is at least as smooth as the smoother one of the original two functions. In particular, let µ denote the density of a random variable with respect to Lebesgue measure and consider the function fµ given by fµ(x) := (f ∗ µ)(x) = ∫ Rn f(y)µ(x− y)d(y) = Eµ[x+ Z], where Z is a random variable with density µ. Since µ is a density with respect to Lebesgue measure, fµ is differentiable [5]. The above convolution-based smoothing technique has been extensively studied in stochastic optimization, e.g., [5,11,20,30,31]. For the sake of simplicity, we assume throughout this subsection that ‖ · ‖ = ‖ · ‖2 and Z is uniformly distributed over a certain Euclidean ball. The following result is known in the literature (see, e.g., [11]). Lemma 3 Let ξ be uniformly distributed over the l2-ball B2(0, 1) := {x ∈ Rn : ‖x‖2 ≤ 1} and u > 0 is given. Suppose that (1.4) holds for any x, y ∈ X + uB2(0, 1). Then, the following statements hold for the function fu(·) given by fu(x) := Eµ[f(x+ uξ)]. (3... |

17 |
Approximate Methods in Optimization Problems.
- Demyanov, Rubinov
- 1970
(Show Context)
Citation Context ... section is to establish the optimality or near optimality of the classic CG method and its variants for solving different classes of CP problems under an LO oracle. More specifically, we discuss the classic CG method for solving smooth CP problems F1,1 L,‖·‖(X) in Subsection 3.1, and then present different variants of the CG method to solve nonsmooth CP problems F0‖A‖(X,Y ) and F 0 M,‖·‖(X), respectively, in Subsections 3.2 and 3.3. Some discussions about strongly convex problems are included in Subsection 3.4. 3.1 Optimal CG methods for F1,1 L,‖·‖(X) under an LO oracle The classic CG method [13,10] is one of the earliest iterative algorithms to solve problem (1.1). The basic scheme of this algorithm is stated as follows. Algorithm 2 The Classic Conditional Gradient (CG) Method Let x0 ∈ X be given. Set y0 = x0. for k = 1, . . . do Call the LO oracle to compute xk ∈ Argminx∈X〈f ′(yk−1), x〉. Set yk = (1− αk)yk−1 + αkxk for some αk ∈ [0, 1]. end for We now add a few remarks about the classic CG method. Firstly, it can be easily seen that the classic CG method is a special case of the LCP algorithm discussed in Subsection 2.1. More specifically, the search direction pk appearing 9 in the gen... |

16 |
Convergence of a class of random search algorithms. Automation and Remote Control,
- Katkovnik, Kulchitsky
- 1972
(Show Context)
Citation Context .... The intuition underlying such a approach is that convolving two functions yields a new function that is at least as smooth as the smoother one of the original two functions. In particular, let µ denote the density of a random variable with respect to Lebesgue measure and consider the function fµ given by fµ(x) := (f ∗ µ)(x) = ∫ Rn f(y)µ(x− y)d(y) = Eµ[x+ Z], where Z is a random variable with density µ. Since µ is a density with respect to Lebesgue measure, fµ is differentiable [5]. The above convolution-based smoothing technique has been extensively studied in stochastic optimization, e.g., [5,11,20,30,31]. For the sake of simplicity, we assume throughout this subsection that ‖ · ‖ = ‖ · ‖2 and Z is uniformly distributed over a certain Euclidean ball. The following result is known in the literature (see, e.g., [11]). Lemma 3 Let ξ be uniformly distributed over the l2-ball B2(0, 1) := {x ∈ Rn : ‖x‖2 ≤ 1} and u > 0 is given. Suppose that (1.4) holds for any x, y ∈ X + uB2(0, 1). Then, the following statements hold for the function fu(·) given by fu(x) := Eµ[f(x+ uξ)]. (3.24) a) f(x) ≤ fu(x) ≤ f(x) +Mu; b) fu(x) has M √ n/u-Lipschitz continuous gradient with respect to ‖ · ‖2; c) E[f ′(x+ uξ)] = f... |

15 | Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals. - Dunn - 1979 |

15 | Convergence rates for conditional gradient sequences generated by implicit step length rules. - Dunn - 1980 |

14 | Stochastic first- and zeroth-order methods for nonconvex stochastic programming.
- Ghadimi, Lan
- 2012
(Show Context)
Citation Context ...f∗] ≤ can be bounded by O(1) √ nM2D2X 2 . According to the lower complexity bound in (2.6), we conclude that the above complexity bound is nearly optimal for the following reasons: i) the above result is in the the same order of magnitude as (2.6) with an additional factor of√ n; and ii) the termination criterion is in terms of expectation. Note that while it is possible to show that the relation (3.29) holds with overwhelming probability by developing certain large deviation results associated with (3.29), such a result has been skipped in this paper for the sake of simplicity, see, e.g., [16] for some similar developments. 3.4 CG methods for strongly convex problems under an enhanced LO oracle In this subsection, we assume that the objective function f(·) in (1.1) is smooth and strongly convex, i.e., in addition to (1.5), it also satisfies f(y)− f(x)− 〈f ′(x), y − x〉 ≥ µ 2 ‖y − x‖2, ∀x, y ∈ X. (3.32) These problems have been extensively studied in the literature. For example, it has been shown in [26,27] that the optimal complexity for the general first-order methods to solve this class of problems is given by by O(1) √ L µ max ( log µDX , 1 ) . On the other hand, as noted in Su... |

14 | den Hengel. Positive semidefinite metric learning using boosting-like algorithms - Shen, Kim, et al. |

12 |
Bundle-level type methods uniformly optimal for smooth and non-smooth convex optimization.
- Lan
- 2013
(Show Context)
Citation Context ...he following reasons. – Low iteration cost. In many cases, the solution of the linear subproblem (1.2) is much easier to solve than the nonlinear subproblem (1.3). For example, if X is a spectahedron given by X = {x ∈ Rn×n : Tr(x) = 1, x 0}, the solution of (1.2) can be much faster than that of (1.3). – Simplicity. The CG method is simple to implement since it does not require the selection of the distance function d(x) in (1.3) and the fine-tuning of stepsizes, which are required in most other first-order methods (with exceptions to some extent for a few level-type first-order methods, see [4,22]). – Structural properties for the generated solutions. The output solutions of the CG method may have certain desirable structural properties, e.g., sparsity and low rank, as they can often be written as the convex combination of a small number of extreme points of X. Numerical studies (e.g., [17]) indicate that the CG method can be competitive to the more involved gradient-type methods for solving certain classes of CP problems. It is also worth noting that the CG method, when applied to the linear feasibility problems, is closely related to the von Neumann algorithm studied by Dantzig [8,9]... |

11 |
A tight upper bound on the rate of convergence of frank-wolfe algorithm.
- Canon, Cullum
- 1968
(Show Context)
Citation Context ... class of smooth CP problems, denoted by F1,1 L,‖·‖(X), which consist of any CP problems given in the form of (1.1) with f satisfying assumption (1.5). Our goal is to derive a lower bound on the number of iterations required by any LCP methods for solving this class of problems. The complexity analysis has been an important topic in convex programming (see Nemirovski and Yudin [25], and Nesterov [27]). However, the study on the complexity for LCP methods is quite limited. Existing results focus on a specific algorithm, namely the classic CG method. More specifically, in 1968, Canon and Cullum [6] proved an asymptotic lower bound of Ω(1/k1+µ), for any µ > 0, on the rate of convergence for the CG method. Jaggi [18] revisited this algorithm and established a lower bound on the number of iteration performed by this algorithm for finding an approximate solution with certain sparse pattern. Similarly to the classic complexity analysis for CP in [25,27], we assume that the LO oracle used in the LCP algorithm is resisting, implying that: i) the LCP algorithm does not know how the solution of (1.2) is computed; and ii) in the worst case, the LO oracle provides the least amount of information f... |

11 | A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. arXiv preprint arXiv:1301.4666,
- Garber, Hazan
- 2013
(Show Context)
Citation Context ...of ‖ · ‖ will possibly increase the value of the condition number given by L/µ. Motivated by [15], we present a shrinking CG method under the above assumption on the enhanced LO oracle1. Algorithm 3 The Shrinking Conditional Gradient (CG) Method Let p0 ∈ X be given. Set R0 = DX . for t = 1, . . . do Set y0 = pt−1. for k = 1, . . . , 8L/µ do Call the enhanced LO oracle to compute xk ∈ Argminx∈Xt−1 〈f ′(yk−1), x〉, where Xt−1 := {x ∈ X : ‖x− pt−1‖ ≤ Rt−1}. Set yk = (1− αk)yk−1 + αkxk for some αk ∈ [0, 1]. end for Set pt = yk and Rt = Rt−1/ √ 2; end for 1 We recently notice that Garber and Hanzan [14] have made some interesting development for CG methods applied to strongly convex problems It should be noted, however, that the algorithm and analysis given here seem to be different than those in [14] 16 Note that an outer (resp., inner) iteration of the above shrinking CG method occurs whenever t (resp., k) increases by 1. Observe also that the feasible set Xt will be reduced at every outer iteration t. The following result summarizes the convergence properties for this algorithm. Theorem 6 Suppose that conditions (1.5) and (3.32) hold. If the stepsizes {αk} in the shrinking CG method are s... |

11 | Positive semidefinite metric learning using boosting-like algorithms. - Shen, Kim, et al. - 2012 |

10 | A conditional gradient method with linear rate of convergence for solving convex linear systems - Beck, Teboulle |

9 |
An -precise feasible solution to a linear program with a convexity constraint in 1/2 iterations independent of problem size.
- Dantzig
- 1992
(Show Context)
Citation Context ...4,22]). – Structural properties for the generated solutions. The output solutions of the CG method may have certain desirable structural properties, e.g., sparsity and low rank, as they can often be written as the convex combination of a small number of extreme points of X. Numerical studies (e.g., [17]) indicate that the CG method can be competitive to the more involved gradient-type methods for solving certain classes of CP problems. It is also worth noting that the CG method, when applied to the linear feasibility problems, is closely related to the von Neumann algorithm studied by Dantzig [8,9], and later in Epelman and Freund [12]. This paper focuses on the complexity analysis of CP under an LO oracle, as well as the development of new LCP methods for large-scale CP. Although there exists rich complexity theory for the general first-order methods for largescale CP in the literature, the study on the complexity of CP under an LO oracle is still limited. More specifically, in view of the classic CP complexity theory [25,27], if f is a general nonsmooth Lipschitz continuous convex function such that |f(x)− f(y) |≤M‖x− y‖, ∀x, y ∈ X, (1.4) then the number of iterations required by any ... |

8 | Conditional gradient algorithms for machine learning. NIPS OPT workshop, - Harchaoui, Juditsky, et al. - 2012 |

8 | Conditional gradient algorithms for rank one matrix approximations with a sparsity constraint. - Luss, Teboulle - 2013 |

8 |
Barrier subgradient method.
- Nesterov
- 2008
(Show Context)
Citation Context .... c) If f is a general nonsmooth function satisfying (1.4), we show that the CG method can achieve a nearly optimal complexity bound in terms of its dependence on after properly incorporating the randomized smoothing technique (e.g., [11]). In particular, by applying this method to the bilinear saddle point problems with f given by (1.6), we obtain an first-order algorithm which only requires linear optimization in both primal and dual space to solve this class of problems. It appears to us that no such techniques have been presented before in the literature (see discussions in Section 1 of [29]). d) We also discuss the possibility to improve the complexity of the CG method under strong convexity assumption about f(·) and with an enhanced LO oracle. Thirdly, we present a few new LCP methods, namely the primal averaging CG (PA-CG) and primal-dual averaging CG (PDA-CG) algorithms, for solving large-scale CP problems under an LO oracle. These methods are obtained by replacing the projection subproblems with linear optimization subproblems in Nesterov’s accelerated gradient methods. We demonstrate that these new LCP methods not only exhibit the aforementioned optimal (or nearly optimal) ... |

7 | Dual subgradient algorithms for large-scale nonsmooth learning problems.
- Cox, Juditsky, et al.
- 2013
(Show Context)
Citation Context ...so well-known that for general first-order methods, one can employ non-Euclidean norm ‖ · ‖ and the distance function d(x) in (1.3) to accelerate the solutions for CP problems with certain types of feasible sets X. However, we demonstrate that the CG method is invariant to the selection of ‖ · ‖ and thus self-adaptive to the geometry of the feasible region X. b) If f is a special nonsmooth function given by (1.6), we show that the CG method can achieve the lower complexity bound in (1.8) after properly smoothing the objective function. Note that, although a similar bound has been developed in [7], the optimality of this bound has not yet been established. In addition, the smoothing technique developed here is slightly different from those in [28,7] as we do not require explicit knowledge of DX , DY and the target accuracy given in advance. c) If f is a general nonsmooth function satisfying (1.4), we show that the CG method can achieve a nearly optimal complexity bound in terms of its dependence on after properly incorporating the randomized smoothing technique (e.g., [11]). In particular, by applying this method to the bilinear saddle point problems with f given by (1.6), we obtai... |

5 |
Converting a converging algorithm into a polynomially bounded algorithm.
- Dantzig
- 1991
(Show Context)
Citation Context ...4,22]). – Structural properties for the generated solutions. The output solutions of the CG method may have certain desirable structural properties, e.g., sparsity and low rank, as they can often be written as the convex combination of a small number of extreme points of X. Numerical studies (e.g., [17]) indicate that the CG method can be competitive to the more involved gradient-type methods for solving certain classes of CP problems. It is also worth noting that the CG method, when applied to the linear feasibility problems, is closely related to the von Neumann algorithm studied by Dantzig [8,9], and later in Epelman and Freund [12]. This paper focuses on the complexity analysis of CP under an LO oracle, as well as the development of new LCP methods for large-scale CP. Although there exists rich complexity theory for the general first-order methods for largescale CP in the literature, the study on the complexity of CP under an LO oracle is still limited. More specifically, in view of the classic CP complexity theory [25,27], if f is a general nonsmooth Lipschitz continuous convex function such that |f(x)− f(y) |≤M‖x− y‖, ∀x, y ∈ X, (1.4) then the number of iterations required by any ... |

4 | New Analysis and Results for the Frank-Wolfe Method. ArXiv e-prints, - Freund, Grigas - 2013 |

4 | Sparse Convex Optimization Methods for - Jaggi - 2011 |

4 | Numerical methods in extremal problems - Pshenichnyi, Danilin - 1978 |

3 |
A modified frank-wolfe algorithm for computing minimum-area enclosing ellipsoidal cylinders: Theory and algorithms.
- Ahipasaoglu, Todd
- 2013
(Show Context)
Citation Context ...nsmooth CP problems F0‖A‖(X,Y ) and F 0 M,‖·‖(X), respectively, in Subsections 3.2 and 3.3. Some discussions about strongly convex problems are included in Subsection 3.4. 3.1 Optimal CG methods for F1,1 L,‖·‖(X) under an LO oracle The classic CG method [13,10] is one of the earliest iterative algorithms to solve problem (1.1). The basic scheme of this algorithm is stated as follows. Algorithm 2 The Classic Conditional Gradient (CG) Method Let x0 ∈ X be given. Set y0 = x0. for k = 1, . . . do Call the LO oracle to compute xk ∈ Argminx∈X〈f ′(yk−1), x〉. Set yk = (1− αk)yk−1 + αkxk for some αk ∈ [0, 1]. end for We now add a few remarks about the classic CG method. Firstly, it can be easily seen that the classic CG method is a special case of the LCP algorithm discussed in Subsection 2.1. More specifically, the search direction pk appearing 9 in the generic LCP algorithm is simply set to the gradient f ′(yk−1) in Algorithm 3, and the output yk is taken as a convex combination of yk−1 and xk. Secondly, in order to guarantee the convergence of the classic CG method, we need to properly specify the stepsizes αk used in the definition of yk. There are two popular options for selecting αk: one is... |

2 | An optimal affine invariant smooth minimization algorithm. arXiv preprint arXiv:1301.0465 - d’Aspremont, Jaggi - 2013 |

1 | Rounding of polytopes in the real number model of computation - Khachian - 1996 |