Paper ID: | 5131 |
---|---|

Title: | Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel |

Summary: The paper studies the generalization and optimization aspects of regularized neural networks, and provide two key contributions: (a)they show that a O(d) sample complexity gap between global minima of regularized loss and the induced kernel method. (b). They also establish that in infinite-width two-layer nets, a variant of gradient descent converges to global minimum with of (weakly) regularized cross entropy loss in poly iterations. Detailed comments: 1. The paper studies a natural and important problem and makes fundamental contributions in this direction. Recent results in deep learning theory exploits this neural tangent connection to prove optimization and generalization results. In light of this, it is important to study the limitations of this. This highlights that recently popular neural tangent kernel view has its own drawbacks and perhaps(?) insufficient to explain the success of modern machine learning. 2. The paper is very well-written with the problem setting and intuitions clearly explained. In particular, the authors explain the intuition behind their lower bound: why the induced kernel method will require more samples. 3. The technical content is solid and rigorous, with novel contributions of tools which could be useful in general. Moreover, the high level ideas are well succinctly explained, for example, the key idea they exploit for the upper bound is showing that global minimizer of regularized loss converges to max margin solutions. Another important technical contribution they remark is a new technique for proving lower bound of kernel methods which is explained done exploiting the fact that the predictor in the RKHS lies in the span of the data. I skimmed through the proofs of generalization, and the arguments seemed fine and well written to me. 4. A good thing is that the authors remark the generality of the result presented; for example, the sample complexity gap is showed to hold for multi-class classification and regression. This answers natural questions on how general are the ideas to extend. The authors also remark these results can potentially hold even when the only regularization is the implicit regularization of optimization. 5. The only section that required multiple re-reads for me is Section 3. Perturbed Wassertien Flow. Perhaps this is because I am not familiar with relevant literature, but I had a feeling that it was rushed and should have been explained/presented better. For example, there is no intuition provided how eqn 3.1 came about or what it means. Similarly, for eqn 3.2, the authors explain what U is but what about the first term? A few more lines explaining this will help readability. This is important because it provides more accessibility to a broad audience as NeurIPS. Moreover, the authors don't give intuition/details of proof techniques for the optimization result in the main paper. I hope they polish section 3 in the revision. Minor comment: Please explain somewhere why it is called "weak" regularization.

This paper shows that regularization is helpful for the sample complexity in some settings. Specifically, they consider two-layer ReLU networks, and carefully construct a data distribution such that the optimal regularized neural net learns with O(d) samples while the NTK network requires \Omega(d^2) samples. Basically, their data distribution has very sparse features such that the regularization helps to find informative features while NTK overfits to noise. The authors also show that for infinite-width two-layer networks, noisy gradient descent can find global minimizers of the l_2 regularized function within polynomial time. For the proof, in order to upper bound the sample complexity for regularized neural nets, the authors show that with a small regularizer (goes to zero), the global minimizer of regularized logistic loss will converge to a maximum margin solution. This holds for any positive homogeneous predictor, including multi-layer ReLU nets. The authors develop new techniques to lower bound the sample complexity of NTK. These techniques can be applied to other cases. Overall, this is a good theory paper addressing important problems with novel techniques. The paper is well written. Here are my main comments: 1. The optimization result of the infinite-width nets with regularized loss function requires the activation function to be differentiable, which does not hold for ReLU. I was wondering whether some smoothing techniques can resolve this issue. It might be better to discuss the challenges to extend to ReLU nets and the possible approaches. 2. Theorem 4.3 only means using a wider network won’t hurt the generalization bound. It will be more interesting if the authors can show using a wider network can strictly improve the maximum margin achievable. Of course, this may not be true. The authors can discuss a bit more about the possibility of establishing a gap of generalizations between narrow nets and wide nets. Here are some minor comments: 1. It might be better to move Figure 1 to page 4 since the figure is an illustration of the distribution. 2. Line 44: ``than than’’ 3. Line 126: ``a** ``a****b’’ ------------------------------------------------------------------ I thank the authors' clarifications, which have successfully addressed my questions. I will keep my score as it is.**

Originality: One of the novel contribution of this paper is the proof of a lower bound of the sample complexity of the NTK, which requires to explore the following observation mathematically: NTK could not focus on the informative signal of the inputs. Quality: I think this is a high quality paper, which contains a lot of contents. In my opinion, Theorem 2.1 alone together with some additional experiments would suffice to make an interesting paper. Clarity: The paper is very well-written. Thought it is quite technical, the authors make some effort in explaining the intuition and ideas behind some of the technical proofs. Significance: I think this is a good paper that helps us better understand the NTK and its limitation. -------------------------------------------------------------------------------------------- Update: I will keep my score.