NIPS 2017
Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center

### Reviewer 2

The paper pulls together a few important recent ideas on optimizing deep neural networks (as an alternative to popular SGD variants). The derivation of the Tikhonov regularized problem in Eq. (4) and (5) from the recursive objective in Eq. (2) through relaxing the ReLU outputs as a convex projection is very clean and convincing. The decomposition into an inverse problem (activations), a least-squares problem (weights) and a final classification problem (soft-max weights) is illuminating. This formulation provides a novel perspectives on deep learning and offers many interesting avenues, e.g. generalization of ReLUs, more general connectivity patterns through shortcuts (as suggested in the used architecture), sparsity in network connectivity, etc. When it comes to optimization, the paper points out that it is difficult to guarantee convergence of a naive alternating optimization scheme (e.g. ADMM) and thus resort to an easier to analyze block-coordinate descent algorithm, presented in Algorithm 1 and analyzed in Section 4. On the plus side: the authors are able to derive a convergent algorithm. On the minus side: it seems that there is a lot more to explore here and that the paper is merely a first (somewhat preliminary) step towards exploiting the derived problem formulation. A proof of concept on MNIST data is given in Section 3. One has to say, that the experimental section is relatively weak: one data set, matlab vs. phython code, very limited analysis. This requires a more extensive quantitative (run-time, solution quality) and qualitativ investigation. Figure 4 - a percentage pie chart is really confusing. For the final version, a more extensive evaluation would greatly improve the quality of the paper.

### Reviewer 3

This manuscript proposes a modification of feed-forward neural network optimization problem when the activation function is the ReLU function. My major concern is about the convergence proof for the block coordinate descent algorithm. In particular, Theorem 1 is incorrect for the following reason: A. It assumed that the sequence generated by the algorithm has a limit point. This may not be the case as the set U is not compact (closed but not bounded). Therefore, the sequence may not converge to a finite point. B. It stated that the sequence generated by the algorithm has a unique limit point (line 263: "the" limit point). Even the algorithm has limit points, it may not be unique. Consider the sequence x_i = (-1)^n - 1/n, clearly it has 2 limit points: +1 and -1. For the rest parts of the paper, my comments are as follows. 1. The new formulation seems interesting, and it can be separately discussed from the block coordinate descent part, and it looks to me the major novelty is the new formulation but not the algorithm. However, the major concern for the new formulation is that it has much more variables, which can potentially lead to spatial unfeasibility when there are more data instances. This problem hinders people from applying the formulation to large-scale problems and thus its usefulness is limited. 2. I do not see that the problem in line 149 is always convex with respect to W. This requires further elaboration. 3. In the introduction, some efforts were put in describing the issue of saddle points. However, this manuscript does not handle this problem as well. I'd suggest the authors to remove the related discussion. 4. The format of paper reference and equation reference is rather non-standard. Usually [1] is used for paper reference and (1) is for equation reference. 5. The comparison with Caffe solvers in term of time is meaningless as the implementation platforms are totally different. I do not understand the sentence "using MATLAB still run significantly faster" especially regarding the word "still", as matlab should be expected to be faster than python. 6. The comparison of the objective value in 3(a) is comparing apples to oranges, as the two are solving different problems. === After feedback === I did notice that in Line 100 the authors mentioned that U,V,W are compact sets, but as we see in line 124, U is set to be the nonnegative half-space, which is clearly unbounded and thus non-compact. Therefore, indeed when U is compact, this is correct that there will be at least a limit point and thus the convergence analysis may hold true (see details below), it is not the case for the problem (4) or (5). Regarding the proof, I don't agree with the concept of "the" limit point when t = \infty. I don't think such thing exists by nature. As stated in my original review, there might be several limit points, and by definition we always have all limit points are associated with t = \infty, just in different ways. The authors said in the rebuttal that because their algorithm converges so "the" limit point exists, but this is using the result proved under this assumption to prove that the assumption is true. I don't think that is a valid way of proof. In addition, in the extremest case, consider that the sets U,V,W all consist of one point, say u,v,w, and the gradient at (u,v,w) is not zero. Clearly in this case there is one and only one limit point (u,v,w), but apparently it is not a stationary point as the gradient is non-zero. Therefore, I will keep the same score.