NIPS 2017
Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center
Paper ID: 2109 FALKON: An Optimal Large Scale Kernel Method

### Reviewer 1

Comments: - line 19. "... the a variety of ..." -> "... a variety of ..." - line 25. "... show that there are large class..." -> "... show that there is a large class..." - line 113. "... that in this case t=sqrt(n)log(n) are needed to achieve..." I guess that the word "iterations" is missing? The same concern is with line 114 - Table 1. The last but one line in the table. There is a problem with a reference for "Nystrom + sgd" - line 195. "Intuitevely, these conditions measure the size of H, if gamma is small then H is small rates are faster, and the regularity of f, if r is big f is regular and rates are faster." -> "Intuitevely, these conditions measure the size of H (if gamma is small then H is small and rates are faster) and the regularity of f (if r is big f is regular and rates are faster)."? - line 204. There should be a space between "O(1/n)." and "By selecting" Conclusion: 1) In FALCON we need to use the number of inducing points M ~ sqrt(n). Thus in general we are limited in the number of inducing points we can use if n >> 1. Gaussian Process regression is a probabilistic counterpart of the kernel ridge regression and provides the same regression function, as well as uncertainty quantification, not accessible in case of KRR. Sparse Gaussian Processes (in fact this is the Nystrom approximation), are used to process large samples. Recently, they propose to approximate not the GP model (using Nystrom approximation), but its posterior by optimizing evidence lower bound (ELBO), see https://www.prowler.io/sparse-gps-approximate-the-posterior-not-the-model/ Thus, by considering inducing points as latent variables, we can optimize ELBO to tune their values. By restricting the family of variational distributions and limiting inducing points to belong to some tensor (see applications of tensor methods for GP e.g. in "A. G. Wilson and H. Nickisch. Kernel interpolation for scalable structured gaussian processes // In International Conference on Machine Learning, pages 1775–1784, 2015", "E. Burnaev, M. Belyaev, E. Kapushev. Computationally efficient algorithm for Gaussian Processes based regression in case of structured samples // Computational Mathematics and Mathematical Physics, 2016, Vol. 56, No. 4, pp. 499–513, 2016") we can use tensor algebra to efficiently perform SGD optimization of ELBO even in case when M ~ n and so we can better approximate the posterior. Could the authors comment on this approach to kernel regression learning? What connections does ELBO optimization by SGD wrt inducing points may have with the proposed learning scheme? 2) The topic of the paper is important. The paper is well written and contains not only efficient algorithm, but also useful theoretical results. Do the authors have available programming code with results of experiments? It could be useful to specify a link to github with this code.

### Reviewer 2

The authors introduce a general framework for solving Kernel Ridge Regression problems. They demonstrate and prove that their approach has probably one of the lowest time and memory complexity. The efficiently reduced complexity is mainly reached by combining several well known methods, e.g. the Nyström approximation and a variant of the preconditioned conjugate gradient method. The exploitation of some sophisticated technical tricks also yield reasonable contribution. The presented method has an extensive theoretical foundation and encouraging results on reasonable large data set are provided as well. I have some concerns on the results compared to other methods in Table 2 and Table 3. It is not clear how the computational environments of the different methods were compared. What was the source of the computational times of some alternatives; do they come from the references i.e. they are not directly computed on the same configuration? The comparison could be more convincing if they are obtained at the same conditions. Another remark relates to the conditions exploited in Theorem 1, since all other theorems depend and refer to them. What is assumed about the family of the distribution appearing in Expression (1), e.g. bounded support? It should be explicitly stated there.

### Reviewer 3

Summary: A number of approaches developed in recent years are combined to develop a new fast kernel ridge regressor that can be computed in O(n^1.5) and which achieves state-of-the-art rates of convergence in certain situations. Review: The paper is well written and has good experimental results. The downside of this paper is that it is weak on novelty and it is heavily based on “A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nyström computational, NIPS 2015”. In the first 2-3 pages there is quite some overlap between these two papers: preliminaries overlap quite a bit, sentences appear to be at times slightly perturbed versions of the NIPS 2015 paper, e.g. “Extensive experimental analysis shows that the considered approach achieves state of the art performances on benchmark large scale datasets.” => “Extensive experiments show that state of the art results on available large scale datasets can be achieved even on a single machine.” etc. Preliminaries do not need to be rewritten every time and it is fair enough to reuse parts. Similarly with parts of the introduction, but the amount of similarity suggests strongly that this is a somewhat incremental paper. The main new idea seems to be that random projections are used to speed up preconditioning of the regularized kernel matrix. The paper also develops theory around this idea, showing how you can gain speedups while preserving optimal rates of convergence. I did not go through the supplementary and it is hard for me to judge how much novelty is in the proofs. I imagine that on a high level you have to deal with a number of triangular inequalities: relating the new approximation to Nystrom, Nystrom to KRR, KRR to the true solution and that the main work lies in bounding the distance between the new approximation and Nystrom. In total, I think it is an incremental but valuable paper and it contains just enough material for acceptance. Further comments: Abstract: -“They rely on … and enjoy optimal statistical properties.”  This sentence doesn’t make sense to me. A particular method can be optimal in a particular statistical setting but kernel methods are not per se optimal. P.2 — The KRR runtime seems too pessimistic — I assume the O(n^3) is because of the kernel inverse. Inverting the kernel matrix can apparently done in less than O(n^2.4). This should also reduce the computation time for D&C to below n^1.7 (I assume you use m=n^1/2 many samples) — vs n^1.5 for your method.