NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID: 2592 The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network

Reviewer 1

1) Summary This paper computes the spectrum of the Fisher information matrix of a single hidden-layer neural network with squared loss, under two main assumptions: i) the number of hidden units is large and ii) the weights are Gaussian. Some further work is dedicated to a conditioning measure of this matrix. It is argued that neural networks with linear activation function suffer worse conditioning than those with non-linearities. 2) Major comments I am overall very impressed by this article. As the Pennington and Worah NIPS 2017 paper cited in the article, it successfully uses random matrix theory to analyze the behavior of neural networks, but definitely going one step further. Even though the model is simple, the results are rigorous and quantitative. My main concerns are about the assumptions made in the main result. Detailed comments follow. - While I find the argument consider the weights as random'' (ll. 52--58) very convincing, I think that the Gaussian assumption is quite restrictive. Surely, it is valid during the first step of the training (especially when the weights are initialized at random as Gaussian random variables). But my feeling is that after a very short time training the network, the weights are no longer Gaussian. More precisely, while the randomness' could very well be Gaussian, some structure should emerge and the zero-mean assumption should fail. Do the authors have any comments / experiments regarding the evolution of the distribution of the weights during the training? - A very related question: keeping the Gaussian distribution as a first step, I am curious as to how robust is the proof mechanism to introducing some mean / variance structures in the distribution of the weights? - Th. 1 is very hard to read at first sight and Section 2.5 is very welcome. Is it possible to prove another intermediary result / corollary for a non-linear activation function that is also explicit? - Reading the statement of Th. 1, it seems that f needs to be differentiable. In particular, ReLU-like activation functions would be excluded from the present analysis. However, srelu_0 is investigated in Table 1. Can you clarify (even briefly) what are the regularity assumption one needs on f, if any? - Reading the proof of the main result, I had a lot of trouble figuring the graph structure associated with the indices in the trace. As the article puts it, the figure of a circle with arcs is deceptively simple.'' I suggest to provide another example for a slightly more involved example than \trace{M_1M_2M_1M_2} in order to help the interested reader. - A lot of details from the proof of the main result are missing, often pointing to the Pennington and Worah NIPS 2017 paper (in particular lemmas 4, 5, and 6). While I trust that the maths are correct, I think that the article should be selfsufficient, especially since there is no limit on the size of the supplementary material. 3) Minor comments and typos - l. 81: Eq. (3): I think there is a missing Y in the law under which the expectation is taken - l. 83: J^\top J should read \expec{J^\top J} - l. 103: Eq. (5) vs Eq. (8): please be consistent with your convention for the integral - l. 111: Eq. (8): the far right-hand side can be confusing. I would suggest adding a pair of parentheses. - l. 160: `it's'' -> it is - l. 189: Eq. (27): missing punctuation - l. 335: the definition of \mu_1 and \mu_2 is a bit far at this point. - l. 388: Figure S1: missing index, M should read M_2 on the right side of the figure.