NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID: 514 Posterior Concentration for Sparse Deep Learning

### Reviewer 1

The paper presents few results on the concentration rates of the posterior distribution of the parameters of ReLU networks, along the line of the ones already available in a non-Bayesian setting. My overall score is due to the fact that, in my view, the organisation of the paper is not clear. The new results are jut mentioned at the end of the paper, with few comments, the prior distributions are chosen without cross-references to any known rule. The introduction present a description of the literature, without a presentation of the problem, which is given only in Section 2 (i.e. after the introduction). The authors would like to present the results as applicable to more deep learning techniques, but lack in showing why they are generalisable to cases different from ReLU networks. From the point of view of the clarity, there are a lot of acronyms and mathematical symbols which are never defined, making the paper difficult to read for someone who is not familiar with the subject: since the focus of the paper is between deep-learning and Bayesian nonparametrics the reader possibly interested in it may be not familiar with part of the literature/notation. This point is particularly important given my previous comment on the fact that the definitions are given in a way which is not linear.

### Reviewer 2

Summary: ------------ This paper significantly extends known results and work regarding the consistency and rate-optimality of posteriors associated with neural networks, in particular focusing on contemporarily-relevant deep architectures with ReLUs as opposed to sigmoidal units. By imposing the classical spike-and-slab prior over models, a fully Bayesian analysis is enabled, whereby the model-averaged posterior is shown to converge to a neighborhood of the true regression function at a minimax-optimal rate. The analysis first focuses on deep ReLU networks with a uniform number of hidden units per layer where the sparsity, network parameters, and outputs are bounded. At a fixed setting of number of layers, number of units, and sparsity, the first key result is a statement that the posterior measure of lying outside of an \epsilon_n M_n neighborhood of f_0 vanishes in probability, where \epsilon_n is (~) the minimax rate for an \alpha-smooth function. The specific statement under the architectural and sparsity assumptions appears in Theorem 5.1, which is proved by verifying conditions (11-13) of Ghosal and van der Vaart, which is in turn done by invoking Lemma 5.1 (Schmidt-Hieber) showing the existence of a deep ReLU network approximating an \alpha-smooth function with bounded L_\infty norm. Unfortunately, this result prescribes setting the sparsity (s) and network size (N) as a function of \alpha, knowledge of which does not usually occur in practice. To circumvent this, a hierarchical Bayes procedure is devised by placing a Poisson prior on the network size and an exponential prior over the sparsity. Using an approximating sieve, posterior under this hierarchical prior attains the same rate as the case that the smoothness was known, and that furthermore, the posterior measure that both the optimal network size and sparsity are exceeded goes to zero in probability. This is a high-quality theoretical paper, which I believe also did a very nice job of practically motivating its theoretical contributions. Theoretical investigation into the behavior of deep networks is proceeding along many fronts, and this work provides valuable insights and extensions to the works mentioned above. Clarity: --------- The paper is very well-written and clear. My background is not in the approximation theory of neural networks, but with my working understanding of Bayesian decision theory I had few problems following the work and its line of reasoning. Significance: ------------ As mentioned, the paper is timely and significant. Technical Correctness: ---------------------- I did not painstakingly verify each detail of the proofs, but the results are technically correct as far as I could tell. Minor comments: -------------------- - This is obviously a theoretical paper, but It may be useful to say a few things about how much of a role the function smoothness plays in some practical problems? - Just to be clear: the *joint* parameter prior is not explicitly assumed to factorize, only that its marginals are spike-and-slab (equation 8) and that the model probabilities at a given s are uniform, correct?