Paper ID: | 1398 |
---|---|

Title: | Initialization of ReLUs for Dynamical Isometry |

***Additional comments after author response:*** I expressed no major concerns in the review which needed addressing in the author response. The response did elaborate on the relationship between the approaches to ReLU initialization considered and the earlier portion of the paper - this should be made clearer in the paper. However, as pointed out by the other reviewers, the structure in the proposed Gaussian submatrix initalization has previously been proposed in Balduzzi et al. [2]. --- Paper overview: This paper considers the problem of neural network initialization. It analyzes how signals are transformed through the layers of a feedforward neural network, assuming weights are initialized from Gaussian distributions. Previous work used a mean-field assumption to study these dynamics, and used the results to identify parameters for the Gaussians to ensure stable propagation of the mean of the signal variance through the layers, a necessary condition for training deep networks. This work considers how the distribution of the initial signal variance is transformed through the layers of the network. This is done by introducing an integral operator with an activation-dependent kernel to model the transformation. This operator can be viewed as a generalization of the input-output Jacobian considered in other work, and so its spectrum is of interest for understanding stable signal propagation. Having established these results, the authors consider ReLU networks specifically for the remainder of the paper. They derive the integral operator kernel for ReLU, and analyze its spectrum. The analysis shows that there is no way to guarantee stable signal propagation with regular Gaussian initialization in general (i.e. not just in expectation). Further study of the correlation dynamics illustrates that the correlations increase as one moves through the network - as was previously known - but also that narrower networks exhibit higher variance in the correlation distribution. Thus it is possible that networks with narrower layers might be able to more easily capture and maintain small correlations, yielding better trainability. (On the other hand, the authors also point out that broader layers protect against operator eigenvalues larger than one which are problematic for stable gradient propagation.) The authors next propose a novel initialization scheme for deep ReLU feedforward networks. The approach uses parameter sharing to essentially set up the initialized network to behave like a suitably initialized linear network with half as many neurons in each layer. This allows negative correlations to be implicitly captured and propagated using the “duplicated” neurons. The required parameters can now be initialized either through suitable Gaussian initialization or through orthogonal initialization. Empirical results on MNIST and (clipped) CIFAR-10 give preliminary indications the proposed initializations train better at moderate depth than the standard He initialization. Originality: The paper is the first I have encountered that explicitly models the dynamics of the finite-width signal variance distribution. The resulting integral operator and kernel that are studied in the context of ReLU are thus in some sense new objects being studied for the first time. The proposed new ReLU initializations combines a neat parameter-sharing idea with existing approaches to initializing neural networks. Quality: This is high quality work, and I did not detect any noteworthy technical issues. One minor criticism is that the ReLU initialization proposed does not really use the machinery from the earlier part of the paper - even if it might have been inspired by it(?). Furthermore, due to the weight sharing violating the iid assumption on the weights, the traditional theory for analyzing signal propagation is not strictly applicable. As a result, the paper feels a bit like a mix of two papers: a development of some fascinating theory (which could probably do with more details provided), and an empirical study of a proposed initialization scheme which performs well, but arguably could do with a little more theoretical underpinning. The authors also do not mention the (non-)applicability of existing theory in the weight sharing setting, which I think should be discussed. Clarity: The paper is generally well-written; I list some minor corrections at the end of my review. Significance: I think the idea of analyzing the dynamics of the signal distribution could be quite influential, and is likely to be developed further. The proposed initialization scheme certainly seems to warrant larger-scale investigation; if it holds up it may become a new standard for initializing ReLU networks. Other corrections/suggestions: Thm 1: “as transformation by” -> “as transformed by”; l.109: k_l is referred to simply as k L.127: correct “give rise of the signal propagation” L.132 and elsewhere: use \langle and \rangle for inner product L.139: x_1 and x_1 -> x_1 and x_2 Theorem 4: define \delta_0 and p_{\chi_k^2}. L.178 and below: x^m in text vs. y^m in Theorem 4 is somewhat confusing; similarly \lambda_{l,m} vs \lambda_{m,l} - tidy up notation consistency. L.190: it is unclear to me why the eigenfunction converges to the given \delta_0, which seems different to f_{-1} as specified in Thm 4. L.198: this is the inner product, not cosine similarity - as you point it, it is unnormalized Figure 2 caption: diamonds, not squares, would be a better description L.241: Note that the He initialization does not yield stable signal propagation in conjunction with dropout. An alternative initialization scheme [1] indicates one should take the dropout rate into consideration. How does that interact with this conjecture? L.259 : why include \sigma_{w,l} here? l. 271: “reduce a number” -> “reduce the number” Please specify the batch size used for SGD in your experiments? L.278: “He Initialization” -> “He initialization” Ll.291-292: “more parallel” does not make sense. Perhaps: Closer to parallel, more aligned or more correlated. Right hand plot in Figure 4 does not add much, unlike Figure 3 - it may be more valuable using this space for some more details in the text. L.287: “but is not” -> “but it is not” Please improve the references section (Correct capitalization, missing sources such as for reference [4], cite published versions instead of preprints such as for reference [8]). [1] A. Pretorius, E. van Biljon, S. Kroon, H. Kamper. Critical initialisation for deep signal propagation in noisy rectifier neural networks. NeurIPS 2018. [2] D. Balduzzi, M. Frean, L. Leary, J.P. Lewis, K.W. Ma, B. McWilliams. The Shattered Gradients Problem: If resnets are the answer, then what is the question? ICML 2017

The authors analyze propagation of low-order moments in artificial neural network, with a particular focus on networks with ReLU activation functions. The difference between the proposed approach and those used in previous works is that they consider a non-asymptotic limit; i.e. the authors do not require the width of each layer to go to infinity. Instead, the authors show that for Gaussian weights ensembles, the distribution of the pre-activations conditioned on the preceding layer post-activation is Gaussian. They then recursively compute the distribution of 2-norms of outputs and expected covariances both conditioned on inputs. The authors analyze the change of measure between two layers using an integral operator and study its spectrum. They use these results to propose an initialization scheme for ReLU networks which allows them to train very deep networks without skip connections and batch normalization. The results in this manuscript are interesting, however their relation to preceding work, both quoted un-quoted ought to be better explained. Firstly, regarding the novelty of the main contribution --- previous work by Pennington et al. has devised isometric and nearly isometric initializations for ReLU networks by shifting the bias appropriately. Another approach, identical to the one taken in this work has emerged from the Shatterew gradients paper [Balduzzi]. While the considerations were not explicitly motivated by mean-field theory, they do consider two point correlations just like this current work, casting doubt about the novelty of this work. Secondly, the relation between the integral operator and the Jacobian matrix used in the previous papers using mean field theory is not made explicit. Both the authors of the work under review and the authors of the mean field papers used a weak-derivative operator but either focused on the change of measure or the treated them as random matrices. It should therefore be stressed that the approach proposed in this work is another interpretation of the same method applied to the same problem. This in no way detracts from the value of the paper, and would only strengthen its connections to existing literature. Finally, a few small issues make it harder to understand the authors points: * Theorem 1 is presented in a super dense fashion * independence and distribution claims are sometimes ambiguous (conditional independence and distribution over random weight distributions?) * N-fold convolution \(p^{*N_{l-1}}_{\phi(h_y)}\) and L-fold application of $T_l$ should be clearly explained * The generalized inverse is not defined $\phi(\cdot)^{-1}$ * Figure two is not clearly labeled. I would suggest changing the colors and/or labeling * Figure 3 $\&$ 4 would benefit from having the legend outside, to make it clearer that the labels apply to both.

Originality: I think the proposed initialization is very interesting and seems to work well for fully-connected network with Relu. Could it work well for Conv-nets and perform comparable to ResNet (on cifar10 and imageNet)? Also, several theorems in this paper seems to be known in previous works and it is better to phrase them as lemmas: Thm 2 is standard and has been widely used in recent mean field/overparameterization papers; Thm 3 seems to be similar to Cor 1 page 8 of [HR]; Equations 9 and 10 of Thm 4 seem to be known in [CS]. Quality: the biggest question I have is: can the framework of this paper (theorem 1 and 4) tell us new insight we cannot obtain from recent mean field papers and [HR], [H] and etc.? Also Section 3 and 4 seem to be loosely related to the theory part, i.e. Section 2. They are mostly about dynamical isometry. Could you explain a tighter connection? Clarity: the paper is well-written. Significance: the impact of the paper could be improved if the authors could: 1. using theorem 1 and 4 to obtain new insights about finite width networks; 2. illustrate the success of the initialization method on cifar10 and imagenet. [HR] Boris Hanin and David Rolnick: How to Start Training: The Effect of Initialization and Architecture [H] Boris Hanin Which Neural Net Architectures Give Rise to Exploding and Vanishing Gradients? [CS]:Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in neural information processing systems, 2009.