NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID: 776 Stochastic Composite Mirror Descent: Optimal Bounds with High Probabilities

### Reviewer 1

This paper studied stochastic composite mirror descent, for both convex and strongly convex objectives, and established high probability bounds. The results are stronger than existing ones from the perspective of removing smoothness assumptions or improving the rates, removing the bounded subgradient assumption, and obtaining high probability bounds rather than expectation. The authors also studied the scaling of the estimation error with respect to the number of iterations of the SGD. It seems that the authors claimed that the introduction of Assumption 1 is novel, as lines 113-117 said, and also there were some interesting martingale arguments. The results are clearly stated. The reviewer finds this paper non-trivial, but it is not clear whether the contributions are significant. In particular, the rates in this paper have essentially been obtained in the past, and it is the reviewer's understanding that a more careful martingale analysis and Assumption 1 help weaken the previous assumptions. The reviewer thinks the paper has some incremental value.

### Reviewer 2

* summary: In this paper, the authors studied the optimization and generalization of the stochastic composite mirror descent (SCMD) method for machine learning problems. In the convergence analysis, a common assumption, i.e., the boundedness of the gradient vector, was not assumed. Then, the generalization ability of the kernel-based SGD method was established. The generalization bound revealed the fact that the estimation errors in the generalization error will never essentially dominate the approximation and computation errors. As a result, one can run the SGD with a sufficient number of iterations with little overfitting if step sizes are square-summable. Some numerical experiments were conducted to confirm these theoretical findings. * comments: The authors mentioned their motivation clearly, and they could successfully remove the common boundedness assumption of the gradient vector from the theoretical analysis of the SGD. Some related works were discussed in detail. Numerical experiments agreed to the theoretical findings. - It is not clear whether removing the bounded gradient assumption is the key to obtain the assertion in line 309. How does the following finding in line 309 relate to the bounded gradient assumption? > the estimation errors in the generalization error will never essentially dominate the approximation > and computation errors and one can run SGD with a sufficient number of iterations with little > overfitting if step sizes are square-summable. - line 194: Is Psi(w) equal to 1/2 |w|^2? - Assumption 3: Showing a typical value of alpha for several problem setups would be helpful for readers. - Should the kernel function be universal to agree to Assumption 3? (The universal kernel is defined in [Steinwart and Christmann, Support Vector Machines, 2008]). Adding a supplementary comment about the relation between Assumption 3 and some properties of kernel functions would be nice.

### Reviewer 3

The paper focused on the stochastic composite mirror descent algorithms which update model parameters sequentially with cheap per-iteration computation, making them amenable for large-scale streaming data analysis. The setting is quite general including general convex objectives and penalty terms, which could be just convex or strongly convex. The main contribution of the paper is the almost optimal rates with high probability which match the minimax low rates for stochastic first-order optimization algorithms. The techniques are quite delicate and novel which are summarized in lemma 1 and Theorem 3. Another technique is a weighted summation which originates from [29]. In summary, the paper presents novel theoretical convergence rates for SCMD which are almost optimal, up to a logarithmic term. These comprehensive results are in the form of high probabilities which are non-trivial extension and refinement of existing literature. The paper is well presented and Section 5 on related work well summarized the main contribution of the work in a detailed comparison with existing work. Minor comments: 1. The discussion about Theorem 4 and Corollary 5 in lines 136-140 should be moved after the statement of theorem 4 and corollary 5. 2. In line 198, Assumption 3 is mentioned before it is introduced. 3. Personally, I do not see what is the main purpose of Section 6 (simulations). The main meat of the paper is novel theory which is quite satisfactory to me. I read the authors' feedback. I am satisfied with their response.