NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:7751
Title:Semi-flat minima and saddle points by embedding neural networks to overparameterization

Reviewer 1

+++ post rebuttal update: I agree with the other reviewers that the clarity of some of the definitions and results of this paper should be improved but I do not think this is a significant argument in favor of rejection. My own concerns have been addressed successfully in the rebuttal and if the authors revise their work as promised I think it is in very good shape. I will thus keep my score as it is. +++ This work generalizes the seminal Fukumizu & Amari [4] paper from 2000 in the following ways: - Section 3 studies embedding networks with size H_0 in a wider network of size arbitrary sizes H compared to just H_0+1 in [4]. Contrary to the case in [4], the networks can be of depth >3 and in this case embeddings (ii) and (iii) do no longer necessarily give rise to critical points. - Theorem 5 studies the specific 1HL case and finds that embedding (i) gives rise to a saddle point - Section 4.1 generalizes the original embeddings of [4,14] to the case of ReLU activations and shows that two of them yield more flat directions compared to smooth activation functions. - Section 4.2. shows that a minimum is always embedded as a minimum (saddle) for the inactive unit (replication) method. Again, I think this is only for 1HL networks but the Theorems do not specify this. - Finally Section 5.2 makes an argument for better generalization properties of ReLU (compared to smooth activation) networks in case of embedding a zero-residual minimizer. I am not an expert in generalization bounds so I will not comment on this matter and thus enter a lower overall confidence score. I consider these results to be interesting contributions. Yet, the generalization to deep nets is only made for the embeddings of critical points in general and not the embeddings of minima. The ReLU results are nice but at the same time not really surprising. In summary, I think the contribution is somewhat incremental but still valid and the presentation of the results is well done.

Reviewer 2

This paper studies the landscape of training loss function for neural networks. Specifically, the authors consider three methods for embedding a network into a wider one and investigate how the minima of the narrow networks perform on the landscape of the loss function for training the corresponding wide networks. The theoretical results show that the network with ReLU activation gives flatter minima which suggests better generalization performance. This paper has the following issue: 1. The contributions and results of this paper are not significant and important enough. The goal of this paper is not clear. In specific, the proposed three embedding methods can only cover a small subset of the representation functions of a wide neural network. In fact, most of the stationary points found by optimization algorithms for training wide neural networks do not correspond to the stationary points for training some small neural networks. 2. The representation of this paper is not clear. The authors propose three embedding methods including unit replication, inactive units, and inactive propagation, but do not clearly clarify them in the major theoretical results (Theorems 5 and 9). The authors should identify which embedding method is applied in these theorems and briefly discuss the corresponding theoretical results (comparison between different embedding methods). 3. For smooth activation and ReLU activation, the authors consider different embedding methods, the authors should clearly identify the difference and briefly discuss why such difference is necessary. 4. The comparison between generalization bounds for networks with ReLU activations and smooth activations is not fair because the results are derived using different choices of distributions P and Q. The authors should also discuss the generalization performance using the same choice of P and Q. 5. The experiment setting is not consistent with the theoretical results. In the experimental part, the authors set the output dimension to be 1, however, in the statement of Theorem 5, it requires that the output dimension is greater than 1. After reading rebuttal: The authors have answered my concern regarding the expressive power of the proposed embedding methods. I would like to increase my score to 5.

Reviewer 3

The paper considers three ways to make neural networks be overparameterized and their corresponding landscape analysis, including unit replication, inactive units, and inactive propagation. Both ReLU and smooth activation functions are considered. By employing PAC-Bayesian theory, the paper shows that ReLU achieves better generalization compared with Tanh. The paper is interesting. However, I have the following concerns: 1. What is the definition of "semi-flatness"? It seems not be clearly defined in main context. What is the difference of "semi-flatness" and "flatness"? 2. Employing PAC-Bayesian theory to explain the benefits of flat minima is already shown in previous literature (for example, [1]). The authors may oversell their contribution on generlization error bounds (i.e. section 5.2). [1] Neyshabur et al. Exploring generalization in deep learning. NeurIPS 2017. 3. In the paper, it seems to show that the method of unit replication method is not good since it introduces saddle points. In contrast, the method of inactive units is good since it gives the embedding semi-flat minima. How the number of added units (replicated or inactive) affects the landscape? For example, how is more inactivate (replicated) units related to the flatness (saddle point)? How does it guide through us to take advantage of specific ways of overparameterization? =======POST REBUTTAL======== I have read the rebuttal and would like to keep the score.