NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 5518 A unified variance-reduced accelerated gradient method for convex optimization

### Reviewer 1

This paper proposes a novel randomized incremental gradient algorithm Varag for minimizing convex problems with finite sum structure. The algorithm efficiently incorporate the variance reduction technique in order to achieve the optimal convergence rate for convex or strongly convex problems. The algorithm is at the same time simple and rich, which admits many interesting features. For instance, the algorithm admits a non-uniform sampling strategy to handle the case that the Lipschitiz constant are unbalanced; also the algorithm is able to handle non-Euclidean case via Bregman distance. It is a great contribution to combine all these components together in a relatively simple algorithm. Moreover, extensions to stochastic setting and error bound condition are also included. Experiments on logistic regression or Lasso are conducted showing that the algorithm is competitive compared to existing accelerated algorithms. Overall, I am very positive about the paper, even though I have some minor concerns. One of the key feature of the proposed algorithm is to have a phase transition on the parameters. In particular, the parameters are more "aggressive" in the early iterations. And this leads to a very surprising result that the algorithm is converging linearly in the early iterations even in the convex but non-strongly convex setting. More precisely, the linear convergence holds roughly before all the individual function are visited once. Moreover, the convergence parameter in this phase only depends on m, the number individual functions. Could authors provide more intuitions/explanations on why a linear convergence rate is achievable in the early stages? Furthermore, in the early stages, the number of inner loop iterations T_s are increasing exponentially, this is quite similar to the stochastic minimization setting that we increase the sample size geometrically. Could authors provide some motivation of this choice? In the experiment, is the non-uniform sampling strategy used? How does it compared to the case of uniform sampling? EDIT after author's feedback: I thank authors for the clarification, I believe the paper deserves to be published.

### Reviewer 2

Originality. The problem of finite-sum minimization is very popular and attains a substantial attention recent years. Despite to the number of successive works in this field, results on both accelerated and variance-reduced methods are still under investigation and required additional research. The authors proposes new accelerated stochastic gradient method with variance reduction with its global complexity analysis. This method achieves the currently known convergence bounds and has a number of nice features (among others): * Tackling both convex and strongly convex problems in an unified way; * Using only one composite (proximal) step per iteration; * Working with general Bregman distances; * Relatively easy in implementation and analysis. To the best of my knowledge, up-to-know there are no accelerated variance-reduced methods having all these advantages simultaneously. Quality. The paper contains a number of theoretical contributions with proofs. Two algorithms (Varag and Stochastic Varag) are proposed in a clear and self-contained way. A number of numerical experiments on logistic regression with other state-of-the-art optimization methods are made. Clarity. The results are written in a well-mannered and clear way. A small concern from my side would be to make the introduction part more structured. Especially, more concrete examples of the cases, when we can not handle strong convexity of $f$ (but still know the constant) would be interesting. It also would be good to add more discussion and examples to the text, related to "Stochastic finite-sum optimization" (Section 2.2.). Significance. The results of this work seems important and significant to the area of stochastic variance-reduced accelerated methods.