NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 1916 The Landscape of Non-convex Empirical Risk with Degenerate Population Risk

### Reviewer 1

I have carefully read other reviewers' comments and the author feedback, and I understood that it's important to have these details in the applications section. I really appreciate the additional work done by the authors to improve the writing and contents of this paper, and these work addressed my concerns. Thus, I have raised my score. -------------------------------------------------------------------- Summary: This paper shows an interesting result which connects the landscape of the empirical risk and population risk. This result is an extension of (Song, Mei et al, 2016). The previous result establishes important connections between critical points, especially local minima, of empirical and population risk. These connections are helpful in understanding the landscape of the empirical risk and the behaviors of optimization algorithms like gradient descent. Previous work requires three assumptions: The strongly Morse property of the population risk, and the proximity for the gradient and Hessian of the population risk. The authors relaxed the first assumption and only requires the minimum eigenvalue of the Hessian is away from zero. Thus, this paper further improves the understanding of the landscape in a more general setting. They have found some examples for this new result to be applied to and done experiments to verify the theoretical results. Detailed Comments: 1. The authors spend half of the paper explaining the applications of the main result, but only use less than one page to show the main result without providing any sketch for the proof. From my point of view, this is not a very well-organized paper. The main result section is the core part of this paper, so I would expect more explanations of the main result, e.g., a proof sketch with some intuition about the significance of this result. 2. The contents in the Applications section contains too much detail. I don't think we need to look at these applications that carefully because they are just the applications of the main result. In this section, there are tons of constants like 8/7, 1.91, 0.06 and so on, which I don't think people will be interested in, and you can hide those constants using asymptotic notations or some letters. There are also details that are too technical, e.g., the division of regions, and formulas like line 196. I would suggest the authors hide this detail and make more space for the main result section. 3. The experiments only cover the case of the examples, i.e., the rank-1 matrix sensing and phase retrieval. It even doesn't cover your applications because you have proved the case for a more general matrix sensing. I think it would be better to do experiments on settings that are more general, e.g., the setting you use for the first application. 4. It would be better to have more figures in this paper to make people understand better. One figure is provided for the partition of regions in matrix sensing, so it's better to have another one for phase retrieval. Also, for the main result section, the authors can also use a figure to illustrate the assumptions or results better.

### Reviewer 2

This work is an extension of a work by S. Mei et al. that looks at the landscape of empirical risk. In that work, the authors established uniform convergence of empirical risk to population risk and one to one correspondence between critical points of the two function under strongly Morse assumption. In this work, the authors relax that assumption and allow the Hessian to be be even degenerate, provided that it has a negative eigenvalue. This allows the authors to apply the theory to cases were strongly Morse assumption is violated. The authors show that the assumptions are satisfied in two examples, and in one, they show how one can use quotient manifolds to circumvent the fact that at the global min, the Hessian is degenerate, and still apply the same theorem. One minor issue with the result of the paper: can authors briefly describe how theorem 2.1 can be used to get estimation error bounds? This theorem is very strong in the sense that it provides a result for every connected compact subset of B(l). Therefore it seems reasonable to assume that getting estimation error bounds using this theorem should be doable, but the details should depend on the specific examples and underlying distributions. Establishing one to one correspondence between critical points of the two functions are definitely useful, but the ultimate goal, at least in the examples provided is to get estimation error bounds, and unfortunately it is not discussed at all. The paper is very well written and easy to follow. I did not check the details of the proofs.

### Reviewer 3

The paper considers the problem of characterizing the landscape of empirical risk using the population risk. Particular attention is devoted to the case of a degenerate population risk. Their main theorem proves a close connection between stationary point and local minimizers. The paper improves on state-of-the-art in the characterization of the above mentioned landscape. Previous work could not explain the situation where the Hessian is degenerate (has zero eigenvalues) near a stationary point. The main assumption in the present paper is that the absolute value of the Hessian's minimal eigenvalue must be bounded away from zero. This includes includes (some) degenerate saddle point and strict local minimizers. It does not include degenerate minima, which is surprising (a drawback). The presentation of the paper with a separate, clearly written section on the "main result" is good, since the details are rather technical. In the section "main results", comments on the parameters eps and eta are missing. Need the conditions hold for any positive eps and eta? I think, you require only existence of eps and eta such that the assumptions are satisfied. In that case, the reader would like to see an intuitive description of the meaning of these parameters in the presented examples of matrix sensing and phase retrieval and an explanation of the proof strategy for obtaining the parameters. Is there a standard way to compute the parameters eps and eta?