NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:958
Title:Fast Sparse Group Lasso

Reviewer 1

In this work the author(s) focus on block coordinate descent as a method for optimizing the block-sparse version of LASSO. The main contribution is to reduce the computational cost of the iterative thresholding performed when optimizing over each block (dissected as the groups). This reduction is done via bounds on the thresholding decision boundary that are cheaper to compute, but indicate when a block is confidently not zero, or confidently all zero. I have a number of questions about the method as outlined below: 1) This method seems to only be applicable to disjoint groups of coefficients. Many applications require overlapping groups (e.g., Group sparse optimization by alternating direction method by Deng, Yin, and Zhang). Does this method trivially extend? It does not seem to, at least to me. 2) This method focuses only on increasing the efficiency of the block-coordinate descent. Other methods, such as iterative thresholding for group-sparse inference, which does not use a full internal block iteration and instead only performs one step of thresholding per block, can be used instead (e.g., Iterative Thresholding for Sparse Approximations by Blumensath and Davies and others since then). Do any of the benefits transfer? Depending on the number of iterations it takes to compute, these methods could be comparable, or be much slower. It would be useful for the authors to compare to these other methods Minor comments: Line 60: Solution \hat{\beta} is obtained --> The solution \hat{\beta} is obtained Line 122: this norm is broken up over 2 lines: ||K^(g,l)[i,:]||_2

Reviewer 2

Originality: this paper proposes a new bound on the screening criterion that help to reduce the computation of the group Lasso. Combined with block coordinate descent, experiments show a drastic improvement compare to recent screening methods. Quality: the mathematical analysis relies on previous results on screening methods. Though no sketches are given for the most important lemmas and theorem (they all are in the supplementary material), the guiding line is clear and motivated. Clarity: the paper is easy to read. All the idea are motivated and the algorithm come easily. Significance: as group Lasso is a very used and useful methods when dealing with high dimensional data and small dataset, such work is significant as it propose a pretty fast method.

Reviewer 3

Summary: This paper presents a fast block coordinate descent algorithm for the sparse-group lasso problem. Two strategies are proposed to improve the computational efficiency. The first strategy is quickly identifying the groups of inactive features. The idea is to use an easy-to-compute upper bound when checking if the inactive-group condition holds. The second strategy is to select a set of candidate groups and update the feature vectors inside those groups first before iterating over all groups. Empirical comparisons are conducted to show that the proposed method has faster running time than other methods. Strength: Overall the paper is well-written and easy to follow. The proposed algorithm seems new and promising. Weakness: The proof of Theorem 2 is not very clear to me. Theorem 2 says that the proposed algorithm converges to a global minimum if using the same sequence of regularization constants as the original BCD algorithm. Its proof (shown in the appendix) simply uses the fact that the proposed algorithm safely skips the groups that must be zeros, and that each group is updated at least once in every loop. This proof seems to ignore a major difference in the group updating order between the proposed algorithm and the original BCD algorithm: the proposed algorithm first updates the candidate groups and then iterates over all groups, while the original BCD algorithm directly iterates over all groups. Note that the candidate group set does not contain ALL groups that must be non-zeros (it only contains a subset of the those groups, see my detailed comments below). Because the difference in the group updating order may cause a difference in the convergence behavior, the proof of Theorem 2 needs to justify that the proposed algorithm can still converge if the same sequence of regularization constants is used as in the original BCD algorithm. Besides the convergence property in Theorem 2, this paper does not give a convergence rate of the proposed algorithm, and compare that with the original BCD algorithm. One related question: the statement of Lemma 4 is confusing to me. Does Lemma 4 mean that the candidate group set contains *ALL* groups whose parameters must be nonzeros? However, this is not true because it can happen that a nonzero group (i.e., condition (4) does not hold) is *NOT* in the candidate set (i.e., condition (9) does not hold). Therefore, the candidate group set only contains a subset of the nonzero groups. -------------After reading the rebuttal-------------- The authors have addressed my concerns on the proof of Theorem 2 and the statement of Lemma 4. The analysis of the convergence rate is probably hard because of the bias in the update rule. The techniques are practical and I am happy to increase my score.