NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 4202 Stochastic Runge-Kutta Accelerates Langevin Monte Carlo and Beyond

Reviewer 1

The full paper is much longer and compressing it into an 8-page NeurIPS version makes for a terse read; I would advise the authors to summarize certain results so as to make the shorter version easier to read. 1. The discretization in (13) and (14) are given out of the blue. A derivation starting from (6) will be helpful. 2. Line 177: Theorem 1 should say “has _uniform_ local deviation orders”? 3. Section 4.2 is scant on details. It was not clear to me upon a first reading what the non-convex potential is, especially since Section 4.1 uses overdamped Langevin equation with constant diffusion for a strongly-convex potential. 4. Can you plot the dependence of W-2 with dimension in Figure 1? This will help ascertain the improved convergence rate numerically. Similarly, you should use the number of gradient evaluations as the X-axis in Fig. 1b. The SRK scheme uses three gradient evaluations whereas the EM scheme uses one. 5. How do you compute the MSE in Fig. 1c? Why does it increase with the number of iterations?

Reviewer 2

Thanks to the authors for responding to my questions. I am content with the submission and have kept my score at 8. ========= This work proposes a Stochastic Runge Kutta discretization of the overdamped Langevin Monte Carlo chain and establishes a d/eps^{2/3} mixing rate (upto eps accuracy in Wasserstein distance) for d-dimensional strongly log concave distributions which is 1/eps^{1/3} faster than the previously best known rate of order d/eps for the unadjusted Langevin algorithm. They also establish speed-ups for low dimensional non-convex cases when the Hessian operator norm grows as c+sqrt{||x||}. In fact their result in Theorem 1 is general enough to apply to general discretization schemes that satisfy certain uniform (in iteration) local deviation order. The authors note that establishing this condition uniformly is non-trivial in general and in their case, they proceed by bounding the Markov chain moments carefully. Overall, the paper is pretty well written with clearly stated results, contributions, and highlights about the key ideas. In their numerical experiments, they use same step size for ULA and SRK and show that SRK-LD converges to a smaller asymptotic bias (which is consistent with their results). I have three remarks: — If would be nice if the authors could clearly point out the condition number dependency in their theorem 2. — What are some other possible ways to establish a uniform deviation condition? — Do we expect any benefit with such higher-order discretization? (it would have to be in dimension dependency since the mixing rates already depend only logarithmically in epsilon, e.g., Theorem 2 of https://arxiv.org/pdf/1905.12247v1.pdf).

Reviewer 3

After Rebuttal: Thank you for the responses. I that believe the paper will be even stronger with the inclusion of the stochastic gradient-variant. ########################################################### Originality: Given an integrator with some local deviation orders, the paper provides a generic theorem to obtain finite-time Wasserstein bounds. This is a very valuable theorem, which will be useful for other theoreticians working in this field. On the other hand, to the best of my knowledge, this is the first paper that uses a stochastic Runge-Kutta integrator for sampling from strongly log-concave densities with explicit guarantees. The authors further show that their proposed numerical scheme improves upon the existing guarantees when applied to the overdamped Langevin dynamics. Quality: On the theoretical side, I have gone through the proofs; however, I didn’t go over them line by line. The proofs are overall clearly written and the roadmap is clear. On the experimental side, I think the experiments are sufficient since the main contributions of this paper are theoretical. However, I must admit that I don’t believe that the experimental results are charming so that a practitioner would decide to use this integrator in their applications. In this line of research, I think the authors should mention the following paper: “Stochastic Gradient Richardson-Romberg Markov Chain Monte Carlo”, Durmus et al, Neurips 2016, where the authors aim at obtaining a higher-order integration scheme for the overdamped Langevin equation. Clarity: The paper is very-well written and organized. I enjoyed reading it. Significance: Sampling methods based on diffusions have been a major subject in large-scale Bayesian machine learning and computational statistics. In this context the paper is very well-placed and is of great interest to the community. I don’t believe that the paper proposes a practical algorithm that could replace the existing simpler methods, but it has strong theoretical contributions, which should be sufficient for this conference proceeding. A small criticism in terms of significance could be the following. What makes diffusion-based sampling techniques interesting to the ML community is their ability to scale up to large problems since they enable the possibility of replacing full gradients with stochastic gradients. In that respect, an algorithm which requires full gradients is not very interesting to a large portion of the ML community. In the current version of this paper, it is not clear at all what would happen if we apply the same ideas to SGLD. To make the algorithm more appealing for the ML community, I suggest the authors to at least add a discussion for the case where the full gradients are replaced with stochastic ones.