NIPS 2017
Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center
Paper ID: 1459 How regularization affects the critical points in linear networks

### Reviewer 1

The paper analyzes a particular setup of deep linear networks. Specifically, the depth is treated as a continuous variable, and the loss function is the mean squared error. The goal of the paper is to study the behavior of the critical points (number and type) as the regularization parameter changes. Due to the continuous depth setup, the authors can leverage tools from optimal control to characterize the solution of the problem. While the technical aspects of the paper are interesting, the conclusion of the paper seems to be less surprising (stated in lines 222-225), e.g. that the type of the critical points change as a function of lambda. This is expected without relying on such sophisticated analysis: at big lambda we get a convex loss, and thus the index of the critical point of the nonregularized function must eventually turn to zero (when the convex term dominates) from whatever value it had in the nonregularized function. Therefore, it is apparent that the changing lambda is able to affect the type of critical points. Please correct me if I am missing anything here. Despite my concern about the conclusion, I think the techniques used in the paper are interesting and could be useful for researchers who work on analysis of deep networks. In particular, the connection to optimal control via continuous depth idea is fascinating. A minor questions: Why Eq 11 and subsequently Eq 12 and Eq 13 do not depend on lambda? Lambda is still present in Eq 10.

### Reviewer 3

I have had some trouble with this paper as to be honest I am not familiar with this particular strand of the literature. i found the idea interesting though: Ie in effect, they study the properties of the learning dynamics by restating it as an optimal control problem, and then effectively use pertaining ideas (Pontryagin's maximum principle) to study equilibria (which exist due to the linearity of the problem). Some factors struck me as very peculiar though: - Why is it interesting to study linear NNs ? We already know that they have a unique optimal solution (as the authors seem to reconfirm ?) which moreover, we can find in closed-form. Furthermore, nonlinear NN might behave very differently, so studying the linear case might not tell us much about the interesting cases? The authors did not do a good job of explaining the motivation and should consider elaborating on these points. - I did not get the relevance of studying continuous networks (which do not exist). Could the authors instead appeal to discrete optimal control methods (why use 'continuation and what is the approximation error so to speak, by approximating with continuous model?). Did I miss anything? - Indeed I am surprised the authors do not cite works in control / system identification at all, where such linear learning systems have been studied (with much related techniques) for decades. Now, what makes it different from standard control of control-affine system is the fact that the control input is a matrix rather than a vector. However, is that not a case control / system ID has studied too? In summary, I am a bit suspicious of the relevance and novelty of this paper. However, as I am unfamiliar with the related literature in depth, I would say 'in dubio pro reo' with low confidence in my 'verdict'. Minor comments: -line 37 : ' minima' - The authors: explain the term 'continuation'.

### Reviewer 4

The authors provide analysis of linear neural networks with and without regularization using the machinery of Hamilton's equations, and supplement their analysis with numerical results. Their work complements the growing body of literature on the behavior of the critical points of neural networks. Pros: The paper, while technically dense, is fairly easy to read, and the conclusions are clear--especially that of regularization drastically altering the landscape of critical points for linear(!) models. Cons: As always for papers concerning linear models, this reviewer would like some discussion of how these methods might be able to be used to tackle nonlinear networks (if at all). This reviewer also found Figure 1 quite difficult to parse, and would appreciate some extra language/figure engineering to make it clearer (a longer figure caption would probably be fine here). Finally, the authors appear to be analyzing solutions with regularization by first solving for the \lambda = 0 solution, and then locally extending this solution to study critical points near'' it with \lambda > 0. Why should one expect critical points near'' the \lambda = 0 solution to be the sorts of critical points found by a gradient descent algorithm? Said another way, this method seems to bias the analysis towards a certain subclass of critical points, but perhaps the reviewer is just confused. The reviewer would gladly increase the score upon further improvement and polishing of the manuscript.