Paper ID: | 3636 |
---|---|

Title: | Large Scale Structure of Neural Network Loss Landscapes |

The authors propose a phenemenological model of the loss landscape of DNNs. In particular they devise the landscape as a set of high dimensional wedges whose dimension is slightly lower than the dimension of the full space. The authors first start by building a toy model where they show that the assumptions hold. Afterwards they run experiments with a simple CNN model to show the behavior of the loss landscape. They also show how the optimizer traverses the loss landscape for common hyperparameter choices. Overall I think the paper is well written and the different observations are sound. I like the view the of the loss landscape presented in this paper. I also like the experiments that show the behaviors of the different hyperparameters as well. One thing that would have excited me more is a discussion regarding how this view of the loss landscape helps with training (I know it is hard to comment on immediately). Minor Comments: * Typo in line 8: a similar ways -> similar ways * Line 191: is it SimpleCNN or a simple CNN model? After Author Response: I thank the authors for their response. I appreciate that you have run more experiments and that the results look promising. But at the moment I am not going to update my scores without seeing the results.

This paper studies the large-scale structure of neural network objective function (loss landscape). It uses a new idea to not only confirm some known properties of neural network loss landscape but also introduce some new prperties. The authors use the idea of wedges to show three previously known properties of loss landscapes as they call it (1) long and short direction (2) distributed and dense manifold, and (3) connectivity. I should say that because of poor structure of the paper, I could not understand the core part of this paper in which they construct the wedges. All arguments in this paper are based on the concept of wedges. Even though the authors are presenting some nice pictures to make their idea better understood but the text and the pictures are not connected. For example, I do not understand why wedges are disks in Figure 1 when the wedges are supposed to be cuboid (maybe I do not understand the meaning of cuboid here). Another assumption that all other arguments in the paper is built upon is that the loss values stays constant on a wedge. Also the authors are talking about long and short linear dimensions, which I cannot connect. These terms and concepts need to be more explained and defined.

This paper takes a unique approach and aims high. In deep learning, there are all these intriguing empirical observations previously known; the road most travelled to understand them is to prove these observations under certain assumptions, while the authors choose to link these observations through a descriptive model that otherwise could have nothing to do with neural networks. I appreciate this unique approach, which is actually a dominant approach in other sciences like physics. The descriptive model in this paper, if more accurate than not, could potentially simplify the conceptual understanding of optimization for deep learning and motivate new algorithms. There are two reasons why I cannot more enthusiastically recommend this paper. 1) As I'm sure the authors understand, the approach taken always runs the risk of overfitting on the previously observed phenomena. To make a quote, all models are wrong but some are interesting. The authors do a good job at showing that the model fits the previously observed phenomena, but this does not convince me that the model interesting. The authors are unable to propose improvements to current optimization algorithms, or make nontrivial and interesting enough new discoveries, at the same level of, say, large loss for linear interpolation vs. low loss along a curve path. 2) The exposition is messy and difficult to understand. I struggle, for example, to understand how the construction in 3.2 gives rise to the structure described in 3.1 and the radial paths in figure 1. I think this is because the authors use mostly words to describe things that can be more precisely described with mathematics. Expressions such as "Cosines between vector deviations of points on a low-loss connector between optima from a linear interpolation" can be very confusing in words. The experimental figures are also confusing. I am still unclear exactly what procedures are used to make those plots. Altogether, I think this work has much potential but is limited by the two problems I outlined. In its current shape, I still think it should be published as a poster just in case it inspires someone.