NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5317
Title:Efficiently avoiding saddle points with zero order methods: No gradients required

Reviewer 1

Strengths: A well-written paper, with natural flow of reasoning. First, problems with black-box approaches are pointed out. Then it is shown that if the algorithm for well-behaved functions converges, then it converges to the local minimum a.s. Then it’s shown that the algorithm converges a.s. Appendix: boxes with the current goal really help to follow the paper. Weaknesses: -- Experimental section seems redundant. It doesn’t contain precise data which one can use for comparison with other approaches. Convergence to a local minimum is proved, there is no need to show this experimentally. -- Why is parameter \beta needed? It is always present together with either B or g, and therefore can be removed. -- Section B.2 of the appendix is devoted to the proof that a choice of h_0 can be arbitrary, not necessarily random. It would be helpful to provide a motivation (why is this fact useful), since from practical point of view it’s not a problem to initialize h randomly. -- Line 260: “In a typical zero order approach, one could resort to expensive line searches to determine the appropriate value of the gradient approximation accuracy in each step which guarantees the decrease of f”. A reference or other clarification should be given. Suggestions: -- Briefly describe black-box approaches (FPSG, ZPSG) in appendix, before proving Lemmas 6 and 7. -- For proofs and theorems in appendix, which are similar to [JGN+17], it may be potentially better to specify this similarity (e.g. “Similarly to [JGN+17] we establish the following result”) -- Algorithm 1: why special symbol \bot is needed? It can be replaced with x_t.

Reviewer 2

The paper proposes a zero-order optimization method that avoids saddle points efficiently. This is a practically important problem since many of the practical problems are non-differentiable of it is too costly to compute the gradient. Previously published methods of this category are not efficient and scale as O(d^4) with respect to the number of dimensions. The proposed algorithm is is a medley of different finite difference methods combined to maintain converance and efficiency, I really liked the related work session that covers many of the most resent works on the second-order defivative and dereivative free optimization. Generally, overall the paper flows very well and it a pleasure to read. The problematic part for me is virtual lack of the experimental secion. The only real experiment is done in the two dimensional Rastrigin function, which is clear toy example. More importantly, this function is differentiable, which defeats the purpose of zero-order methods. In fact the paper only very briefly mentions the application of the zero-order methods (line 33-41). It is suprizing that none of these applications made it as a experiments for the proposed very practical algorithm.

Reviewer 3

This paper provides interesting results for zero-order methods in non-convex optimization. Under some common assumptions, it shows that the Approximate Gradient Descent algorithm leads us to a second-order stationary point asymptotically with probability 1. It also shows that the PAGD algorithm in this paper can speed up the convergence by escaping strict saddle points efficiently. The convergence guarantee matches that of first-order methods, and experiments were done to verify their theoretical findings. The results in this paper are improvements to previous results, e.g., [JGN+17], [LPP+17] and [JLGJ18]. Specifically, this paper extends previous results to zero-order methods where one do not have access to the gradient. The results are interesting because the zero-order case is harder than the previous ones. This paper is generally written in a clear way and it's easy to understand. However, the writing style of the proof sketch in part 4 and part 5 seems a bit different. In part 4, the authors seem to use a lot of theorems and lemmas, but in lemma 5 there are mostly words. I would suggest the authors hide some details of section 4. Typo: line 270, "i.e. that is" -> "i.e., ".