Paper ID:1089
Title:QUIC & DIRTY: A Quadratic Approximation Approach for Dirty Statistical Models
Current Reviews

Submitted by Assigned_Reviewer_15

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary: The authors propose a new Newton-like method to optimize the sum of a smooth (convex) cost function and multiple decomposable norms. Their contributions are (1) an active subspace selection procedure that allows to speed up the solution of the quadratic approximation problem (2) a proof that solving the quadratic approximation problem over the (changing) active subspace still leads to convergence. The authors also provide numerical results showing that, for two important problems, their methods gives 10x speed up over state-of-the-art methods and, in the appendix, give numerical results that illustrate which fraction of the speed up is due to the quadratic approximation technique and which fraction of the speed up is due to the active subspace selection method.

Quality: The amount of critical information in the appendix makes this paper more suited for a journal than a conference.
Although I did not check all of the proofs, the ones I did check are correct and well written (apart from a few minor typos).

Clarity: The paper is very well written and organized. A few minor suggestions and questions follow. The references could be numbered in order. Including a curly-braket inside the $\min$ in eq. (1) would make it more clear. In line 114, should not the subscript of $\|x\|$ be $(1,\alpha)$ similar to what is written in line 116 ? In Prop. 1, line 236, it would be helpful to clarify that the orthogonal subspace is a function of $\theta$. In line 259 it seems like the over-bar on top of $Q$ is a notation that was not introduced. In line 266 is would be helpful to clarify that $D$ does not need to be diagonal. Line 299 makes reference to eq. 11 and in Th. 1, line 303, makes reference to eq. 8. The appendix also makes reference to eq. 8. This is confusing since in eq. 11 the optimization is constrained and in eq. 8 it is not. It would help clarify that it is because the problem is quadratic that optimizing over the "free" subspace is equivalent to optimizing over the whole space. In line 308, "gives" should be "give". In line 354, it could be useful to remind the reader again that $\Delta_D$ does not need to be diagonal. In Fig. 1, the authors should explain what the percentages mean as well as what the times mean.

Originality: Although there are ideas in the literature related to this work (and these the authors clearly point out) their specific active selection method is new and so is their proof of convergence.

Significance: The 10x speed up is very encouraging and I am certain that the algorithm in this paper will be significant for several practitioners.
Q2: Please summarize your review in 1-2 sentences
A well written paper with significant theoretical and practical contributions. Unfortunately, the amount of critical information in the appendix makes this paper more suited for a journal (so that reviewers can read all the proofs) than for a conference where there is an 8 page limit on the part of the papers that reviewers are supposed to review.

Submitted by Assigned_Reviewer_32

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper addresses the question of estimating so called dirty models in which the parameter is a sum of "simple parameters", complexities of which are individually regularized. The paper proposes a subspace selection strategy that makes quadratic approximation feasible. The theoretic development builds on the notion of decomposable norms.

I have to confess that I have not followed this line of work very closely, so my comments are necessarily those of a generally knowledgeable person in machine learning and model selection.
It appears that the work is a neat combination of ideas that have been published before, such
as minimizing nuclear norm via active subspace selection (ICML 2014), and using quadratic approximation for sparse inverse covariance estimation (NIPS 2011), but the work is now carried out in a more general setting of decomposable norms. To me, this appears elegant, but some questions remain.

The algorithmic achievement of 10-fold speedup does not sound very impressive, such factors are usually easily achieved by optimizing the implementation, thus the question becomes how easy it is to optimize (say parallelize etc.) the most demanding parts of the algorithm. For very practically oriented person the achievement may not appear great.

The presentation of the theory is in general rather clear even if somewhat compact possibly due to page limit. The writing is not impeccable - some of the language errors are distracting to the level that hampers comprehension. The use of citations as a part of sentence is generally not encouraged and it creates new grammatical problems: should we write "[7, 5] consider the estimation ..." or "[7, 5] considers the estimation ..."? Sometimes poor punctuation creates problems like in "[14] in turn use a superposition ..." or "we consider instead a proximal ...".
At worst the sentences are ungrammatical to the level of being incomprehensible like the sentence: "Overall, our algorithmic ...." on page 2.

Writing formulas inside the text creates a set of problems of its own and sometimes the result is awkward - too often do we see a line ending in equal-sign (say 0 = \newline). On page 7, the formula breaches the margin as does the figure 1 and the table 1 on page 8.

In general, it is customary to have some kind of conclusion or discussion in the paper. Now there is none. For example, discussion about the future directions would be welcome, since it is not clear to me where to go from here.

In the algorithm 1, line 7, the superscript (t) should probably be (r), and updating the sum on line 8 looks strange (one should update variables rather than expressions).

Q2: Please summarize your review in 1-2 sentences
The theoretical development for using quadratic approximation for dirty models with decomposable regularizers appears interesting, but it is not clear to me how big a step this is in practice. To some extent, the work appears to me as an end of a branch of development rather than an interesting new opening. The editing is not quite good enough.

Submitted by Assigned_Reviewer_33

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors extended the results in [11] and [12] to more general settings. The proposed computational framework works for
(a) Strongly convex and twice differentiable Loss functions.
(b) Decomposable regularization functions.
The theoretical analysis sounds reasonable, and the numerical results are also convincing. (b) is natural to sparse learning problems, but (a) seems a little restrictive, since many sparse learning problems are formulated as nonstrongly convex problems.
Here are my two concerns:
(1) Eq.(16) seems strong. Could the authors provide some examples?
(2) Block coordinate descent algorithms are also competitive for solving the optimization problem in Section 5.2. Could the authors provide some detailed comparison?

Response to the rebuttal: If we also consider the statistical error, then we usually do not need a "too accurate" solution. To more comprehensively comparing the proposed algorithm with the first-order block coordinate descent algorithms, the authors may need to conduct more experiments, e.g., timing v.s. classification error in multi-task learning. My conjecture is that the local quadratic convergence might not be so competitive in term of the statistical error. But I agreed that it would not hurt to get a "very accurate" solution if the computational resource is affordable.
Q2: Please summarize your review in 1-2 sentences
Overall, I am positive for the quality of this paper. I would like to see this paper appearing in NIPS.
Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their careful comments and suggestions.

Reviewer_15:

1. Thanks for the suggestions. We will fix the suggested typos in the final version.

2. The percentages in Table 1 are error rates, and the time is for solving the optimization problems.

Reviewer_32:

1. Regarding the significancy of the 10x speedup:

For the graphical model selection problem, we compare with the implementation in [16], where the code is already optimized using LAPACK and BLAS libraries. For the multi-task learning problem, we implement all the algorithms ourselves under the same framework, so the comparison is fair since it is with reasonably optimized solvers.
For the graphical lasso problem there are at least 20 different efficient solvers of which all the efficient solvers use active variable selection and the top two additionally use block coordinate descent. Based on the evidence for the graphical LASSO problem and also for the matrix completion problem there is a good reason to believe that our approach is very competitive in terms of algorithmic efficiency.

It is possible that practioners can further speedup the code by tweaking the implementation (e.g., via parallel and distributed computation and others); however, such tweaks are complementary since they could be applied on top of our algorithmic innovations. For example, if we want to speedup the algorithm using multiple cores:
step 3, 4, and 8 in our algorithm (Alg. 1) can be embarassingly parallel. The main computation in Step 9 is to evaluate the objective function value, which also appears in most of the competing algorithms and can usually be parallelized. The subproblem in Step 7 is a quadratic optimization problem with one structured regularizer (trace norm, L1 norm, etc.), where parallel
coordinate descent algorithms [1,2] or proximal gradient methods can be applied.


2. We will fix the suggested typos. We wil also add a brief conclusion in the final submission; we would actually would like to point to some interesting future directions. For example, developing an efficient Quic & Dirty algorithm on a distributed system would be of great practical use. Deriving performance guarantees and optimization approaches under the online setting is also interesting.

Reviewer_33:

1. For the L1-regularization case, eq (16) means there does not exist an index i such that |\nabla_i f(x^*)| = \lambda, which happens with very low probability. We have checked our datasets and none of them violate this assumption. This is also used sometimes for optimization problems with bounded constraint (see assumption 2 in [3] for the convergence of SVM).

2. The comparison of block-coordinate descent and our algorithm for the GMRF learning problem is in Figure 2 in the supplementary material. Note that the block-coordinate descent approach in Figure 2 is not a trivial implementation -- we also apply our active set selection, and we have to develop algorithms for solving the subproblems (they are not quadratic anymore). Please see the details in Section 6.7. Without the active subset selection, block coordinate descent will be much slower than our algorithm in Figure 2.

We have not developed the block coordinate descent algorithm for multi-task problems, but we believe if we apply active set selection and develop efficient subproblem solvers, the comparison with Quic & Dirty will be similar to Figure 2 -- they should be competitive in the early optimization stage, but the convergence speed is slower than Quic&Dirty (since Quic&Dirty can achieve a super linear convergence, and the convergence rate of block coordinate descent is at most linear).


References:

[1] J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel Coordinate Descent for L1-regularized Loss Minimization. ICML 2011.

[2] J. Liu, S. J. Wright, C. Re, and V. Bittorf. An Asynchronous Parallel Stochastic Coordinate Descent Algorithm. ICML 2014.

[3] P.-H. Chen, R.-E. Fan, and C.-J. Lin. A Study on SMO-type Decomposition Methods for Support Vector Machines. IEEE TNN, 2006.