NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 1398 Initialization of ReLUs for Dynamical Isometry

### Reviewer 2

The authors analyze propagation of low-order moments in artificial neural network, with a particular focus on networks with ReLU activation functions. The difference between the proposed approach and those used in previous works is that they consider a non-asymptotic limit; i.e. the authors do not require the width of each layer to go to infinity. Instead, the authors show that for Gaussian weights ensembles, the distribution of the pre-activations conditioned on the preceding layer post-activation is Gaussian. They then recursively compute the distribution of 2-norms of outputs and expected covariances both conditioned on inputs. The authors analyze the change of measure between two layers using an integral operator and study its spectrum. They use these results to propose an initialization scheme for ReLU networks which allows them to train very deep networks without skip connections and batch normalization. The results in this manuscript are interesting, however their relation to preceding work, both quoted un-quoted ought to be better explained. Firstly, regarding the novelty of the main contribution --- previous work by Pennington et al. has devised isometric and nearly isometric initializations for ReLU networks by shifting the bias appropriately. Another approach, identical to the one taken in this work has emerged from the Shatterew gradients paper [Balduzzi]. While the considerations were not explicitly motivated by mean-field theory, they do consider two point correlations just like this current work, casting doubt about the novelty of this work. Secondly, the relation between the integral operator and the Jacobian matrix used in the previous papers using mean field theory is not made explicit. Both the authors of the work under review and the authors of the mean field papers used a weak-derivative operator but either focused on the change of measure or the treated them as random matrices. It should therefore be stressed that the approach proposed in this work is another interpretation of the same method applied to the same problem. This in no way detracts from the value of the paper, and would only strengthen its connections to existing literature. Finally, a few small issues make it harder to understand the authors points: * Theorem 1 is presented in a super dense fashion * independence and distribution claims are sometimes ambiguous (conditional independence and distribution over random weight distributions?) * N-fold convolution $$p^{*N_{l-1}}_{\phi(h_y)}$$ and L-fold application of $T_l$ should be clearly explained * The generalized inverse is not defined $\phi(\cdot)^{-1}$ * Figure two is not clearly labeled. I would suggest changing the colors and/or labeling * Figure 3 $\&$ 4 would benefit from having the legend outside, to make it clearer that the labels apply to both.

### Reviewer 3

Originality: I think the proposed initialization is very interesting and seems to work well for fully-connected network with Relu. Could it work well for Conv-nets and perform comparable to ResNet (on cifar10 and imageNet)? Also, several theorems in this paper seems to be known in previous works and it is better to phrase them as lemmas: Thm 2 is standard and has been widely used in recent mean field/overparameterization papers; Thm 3 seems to be similar to Cor 1 page 8 of [HR]; Equations 9 and 10 of Thm 4 seem to be known in [CS]. Quality: the biggest question I have is: can the framework of this paper (theorem 1 and 4) tell us new insight we cannot obtain from recent mean field papers and [HR], [H] and etc.? Also Section 3 and 4 seem to be loosely related to the theory part, i.e. Section 2. They are mostly about dynamical isometry. Could you explain a tighter connection? Clarity: the paper is well-written. Significance: the impact of the paper could be improved if the authors could: 1. using theorem 1 and 4 to obtain new insights about finite width networks; 2. illustrate the success of the initialization method on cifar10 and imagenet. [HR] Boris Hanin and David Rolnick: How to Start Training: The Effect of Initialization and Architecture [H] Boris Hanin Which Neural Net Architectures Give Rise to Exploding and Vanishing Gradients? [CS]:Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in neural information processing systems, 2009.