Paper ID: 557
Title: A Comparative Framework for Preconditioned Lasso Algorithms
Reviews

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
QUALITY:
This paper show some theoretical results concerning preconditioning LASSO algorithms. In these procedures, the data is linearly projected or transformed into a subspace of the data before applying the usual LASSO variable selection procedure on the transformed variables. The papers shed light on the advantages or disadvantages of doing so. However, there is no practical application where to see how to apply their methodology.


ORIGINALITY:
The main result is that by appropriately transforming the data the success of LASSO depends less on the choice of the penalty parameter. That is, there is a wider range of penalty parameter values that give rise to correct variable selection. The authors also show that a bad choice of preconditioning may lead to the failure of LASSO. Their theoretical results allow the author to suggest a new preconditioning method. The authors corroborate their theoretical results with some simulations. They also suggest a probabilistic framework for future study of preconditioning LASSO methods.


CLARITY:
The paper is a bit difficult to follow without reading back and forth. Figure 2 is unintelligible. The axis and the legend cannot be read. The authors should enlarge the figures. What's the meaning of [[ eta_j > 0]] in the formulas for the thresholds?


SIGNIFICANCE:
The results are interesting, but since the bounds found in the theoretical results are random and dependent on unknown variables, it is not clear how to use them in practice or if they can be used in practice.
Q2: Please summarize your review in 1-2 sentences
The paper shows that preconditioning LASSO may improve variable selection in the sense that the choice of the LASSO penalty parameter becomes less critical. The result is interesting, but it is not clear how it will improve variable selection in practice.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
In this paper, the authors analyze a particular class of lasso variants which they call preconditioned lasso algorithms. In these algorithms, both the design matrix and the response are pre-multiplied by some arbitrary matrices. Some specific choices of these matrices are claimed to improve the feature selection capacity of the lasso. However, these claims have so far been insufficiently verified. The main challenge to theoretical analysis here is that the tuning parameters for the preconditioned set up and the "vanilla" set up may no longer be easy to match, and so it is a challenge to show results for particular choices of the tuning parameter \lambda.

The authors focus on the signed support recovery properties of the lasso for linear regression, that is, if the lasso can correctly recover, for each feature, whether its linear coefficient is positive, negative, or identically zero. They study two preconditioning techniques that are based on singular value decompositions of the design matrix. The main tool presented here is to find conditions under which sign support recovery is possible, and comparing these conditions under preconditioned and non-preconditioned settings. Specifically, they prove two theorems related to the penalty parameters, demonstrating that the bounds on a particular choice of penalty allows for signed support recovery are related between the vanilla and preconditioned technique. Following this, the authors a third theorem for a third preconditioning technique following an exposition of the generative lasso model based on the SVD.

Preconditioned lasso problems are not a major topic of research, so the significance of this paper relies on it providing a major reason to use preconditioning. The authors further limit themselves to linear regression lasso, and to support recovery. The technique of bounding the penalty term is original, but its usefulness is unclear from the text.

The paper is fairly readable, but the main results and points are not immediately clear. The authors should state more clearly how and why their theorems may or may not demonstrate signed support recovery. Further, it is not clear how preconditioning affects the difficulty of the signed support recovery problem (and if this is the same as their result anyway?). In any case, the magnitude of the performance gain (if any) is left ambiguous. The reader should be able to more clearly understand what properties have been exposed of the three preconditioned estimators. Again, as preconditioned approaches are not standard, and their advantages are poorly defined, the theoretical results of this paper seem insufficient to justify their use and therefore elevate their impact.

EDIT after rebuttal: in light of the discussions following the rebuttal period, I am more convinced that the results in this paper are useful and novel. I still feel overall that the work is addressing a somewhat obtuse topic, but I appreciate qualities of the work more fully after further reading. My main issues are reasonable topics of future work, and should not block this work. Specifically: the use of ratios of penalty parameters, outside of the degenerate cases (infeasible bounds, infinite or null bounds) should be investigated (is there any gain from a wider "correct" interval, even when scaled by the maximal penalty)? Performance issues beyond existence of a suitable penalty parameter should also be discussed.
Q2: Please summarize your review in 1-2 sentences
This paper focused on the analysis of preconditioned lasso problems, which are rather a obscure set of techniques whose advantages are not well understood. The authors claim to contribute to clearing this up, but their results are difficult to interpret clearly.

Submitted by Assigned_Reviewer_9

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary:

This paper introduces and develops an interesting framework for various methods for preconditioning the Lasso. Their framework gets around a difficulty of deciding which tuning parameter value to look at via asking which method would be easiest to tune in finite sample (as measured by how large the range of good parameters is for each method).

Quality & Originality:

This paper addresses a well-motivated problem in a precise and complete (given the page constraints) way. Their framework makes intuitive sense and then they show how it may be fruitfully applied to understanding the preconditioned lasso(s). The work is thoughtful and original. The authors provide empirical support in Figures 1 and 2 for their theoretical findings, supporting that this work was conducted in a thorough, careful manner.

Significance:

The authors demonstrate how their framework allows them to make precise, insightful statements about three existing approaches to preconditioning the lasso. This suggests that the framework is quite significant, especially since it is easy to see that it may be used even in more general problems than preconditioning.

Clarity:

This paper's writing was excellent. The introduction describes the problem extremely well. Proofs (which, of course, are relegated to supplementary materials) are described at a high-level in the text, a practice that really helps the reader understand what is happening (whereas many papers, unfortunately, hide away the main ideas leaving the reader only aware of the theorem statements but with no clue where they came from). The manuscript was clearly very carefully proofread! This reviewer wishes more papers he encountered were written this well.

Comments:
- Lemma 1: can you give some intuition for what goes wrong when lambda is either too small or too big? In particular, can you say why the expressions in (10) have the forms they do in the context of this intuition?
- Line 343: "on" should be "in"?
Q2: Please summarize your review in 1-2 sentences
An interesting theoretical framework is developed for comparing various preconditioned lasso proposals. The authors, in an extremely well-written and carefully conducted paper, demonstrate that their framework can lead to a better understanding of how these various proposals compare to the lasso and to each other.

Submitted by Assigned_Reviewer_10

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper addresses preconditioned lasso estimators, and studies their utility from a non-asymptotic, theoretical framework. Their main result is a set of bounds (lower and upper) on lambda that are necessary and sufficient for exact support recovery---they then apply this result to study the effects of different preconditioning methods. I like the general perspective of the paper (though I think the authors could have given a better motivation for their approach). The work is generally well-written, and as far as I can tell, novel. I do have two major comments, given below. After that, I give several more minor comments.

==========
Major comments:

- Lemma 1 seems like a novel and interesting result. But at the same time, I'm suspicious of the "if and only if" here. For this result to be true, it would need to hold that \lambda_l and \lambda_u, as you've defined them, are knots of the lasso solution path. From the LARS algorithm (homotopy algorithm) for the lasso path, we know that the signed support of the lasso solution is piecewise constant, and only changes at data-dependent knots \lambda_1 \geq \lambda_2 \geq \ldots \lambda_r. Hence it either \lambda_l or \lambda_u did not coincidence with one of these knots, then we could slide it to the closest knot (slide to the left for \lambda_l, and to the right for \lambda_u), and we would know that the constructed support wouldn't have changed.

Generally, the formulae for these knots, \lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_r, are complicated since they depend on the constructed support \hat{S} and signs of the lasso solution along the lasso solution path. You've written down formulae for \lambda_l and \lambda_u that depend on S, not \hat{S}, so at a high level, I don't understand how this can be right.

But at a lower level, skimming through the proof of your Lemma 1 in the supplementary material, I don't see any mistakes. Some comments/clarification on my concerns here would be very helpful.

UPDATE: I now believe that this result is correct. Given the assumed conditions, we have \hat{S}=S (perfect recovery) so my initial intuitive concerns are remedied. Furthermore, what you have written down are actually knots in the lasso path, i.e., these can be shown to coincide with "hitting times", the values of lambda at which variables enter the model.

- Section 3.2: It seems that the ratios of the upper and lower lambda bounds, to compare the usual and preconditioned lasso estimates, is not the right metric---instead you would want to compare their relative widths: ( \lambda_u - \lambda_l ) / \lambda_max, where \lambda_max is the smallest value of \lambda for which all components of the estimate are zero. This has a closed form: \lambda_max = \|X^T y\|_\infty, i.e., the largest absolute component of X^T y. Clearly \lambda_max is going to be different between the two problems (as X becomes P_X X and y becomes P_y y for the preconditioned lasso).

A simple example shows that using the ratios can give somewhat undesirable answers: if \lambda_u = 4 and \lambda_l = 2 for the lasso, and \lambda_u = 2 and \lambda_l = 1 for the preconditioned lasso, then your metric would say they perform about the same. But if in this case \lambda_max = 10 for the lasso and \lambda_max = 3 for the preconditioned lasso, then the story is very different: with respect to the full range of effective regularization parameter values, only 1/5 of this range recovers the support for the lasso, and 1/3 of it recovers the support for the preconditioned lasso.

==========
Minor comments:

- Page 1, line 53: I think you mean sufficient (not necessary) for support recovery

- Page 2, line 80: I'm not really sure what you mean by "asymptotic analyses ... provide limited theoretical insight". Of course, an asymptotic analysis is entirely theoretical, and so yes, it does provide such insight. You mean something like satisfactory insight?
- In motivating your approach on page 2, it seems to me like your method is actually closely aligned with comparing the entire regularization paths of (2) and (4), as done empirically in the reference [4]. You are basically answering the question: "does the regularization path contain the true model, at any point along the path?" But you are doing this in an efficient way by deriving upper and lower bounds on lambda.

- Page 3, line 119: It seems strange to take the q *smallest* singular vectors of U (by this, I assume you mean the singular vector corresponding to the smallest singular values). Can you comment on this? This is especially relevant in Theorem 1, where you impose the condition span(U_S) \subseteq span(U_A). With this construction of A, we would probably not at all expect the sets S and A to be close, so this condition seems quite unreasonable ...

- I'm not sure why you don't present the deterministic result for JR (only one based on a generative model)? For completeness in comparing the preconditioning methods, this would have been helpful.

- Equation (21) is confusing: I think I know what you mean here, but this equation is somewhat confusing without seeing context for \hat{\Sigma} (just calling \hat{\Sigma} "unconstrained spectra" is not that informative) ...

- I'm confused about the setup in 5.3. Why have you changed the number of observations to m, and what does it mean for m \rightarrow \infty while n stays fixed?
Q2: Please summarize your review in 1-2 sentences
The authors study preconditioning in lasso estimators. They derive lower and upper bounds on lambda that ensure exacts support recovery in the lasso estimator, and use these bounds to evaluate different preconditioning methods. I generally liked the ideas and approaches in the paper. But I had some fairly major issues with it (described in detail the comments section).
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their comments and will make the necessary
changes to our paper. We answer some specific questions below.

R8

We apologize for not explaining in greater depth in the main
text, but variable selection in \ell_0 norm is a classical NP-hard
problem whose solution is desperately needed in all of the data-driven
sciences. To date, the Lasso is the most computationally tractable
approximation to this goal. Correlated variables are a perennial
problem for the Lasso and frequently lead to systematic failures. This
deficiency has motivated the use of Preconditioned Lasso
algorithms. Prior to our work, the performance gain of Preconditioned
Lasso algorithms could be evaluated only empirically or
asymptotically. Our theory is useful in making these comparisons on
finite datasets. Furthermore, we view our framework as a stepping
stone towards building improved (possibly iterative) algorithms for
approximating the true \ell_0 solution.

R5

The Preconditioned Lasso algorithms we analyzed (HJ, PBHT, JR) were
already applied to a number of practical applications by the original
authors. Our goal was to develop and apply a new framework in order to
theoretically compare them. We demonstrate this framework in Theorems
1--4. We apologize for not fully explaining our notation; the double
square brackets denote a binary number that is 1 iff the statement
inside the brackets is true. In keeping with other theoretical
developments (e.g. [1]), our theory assumes certain unknown
quantities. Despite this, our theory gives intuition into when
algorithms succeed or fail and allows us to suggest changes to
existing preconditioning algorithms.

R9

We appreciate the reviewer's positive comments that our framework can
lead to a better understanding of Preconditioned Lasso variants. We
also hope that the theory can act as a useful tool for developing more
advanced algorithms. We will augment the paper and build more
intuition for the interpretation of Lemma 1.

R10

Lemma 1 of [10] gives necessary and sufficient conditions to show that
an arbitrary vector \hat \beta (which has support \hat S) is optimal,
given a particular \lambda. The PDW construction used for Lemma 2 of
[10] in effect checks if any vector with the true signed support could
be a solution to the Lasso problem. To do this, Lemma 2 constructs a
potential solution \hat \beta with correct support S and then checks
if this vector is a unique, optimal solution for the choice of \lambda
and if it also has the correct signed support. If it does, then
running Lasso under suitable conditions with the same \lambda would
recover the true signed support. With this logic, the Lemma does not
need to mention \hat{S} to certify signed support recovery.

Other comparison metrics involving \lambda_l, \lambda_u could be
considered, and the reviewer's suggestion could be a useful
addition. However, the algebraic degeneracy outlined by the reviewer
would require relatively specific dependence of \lambda_{max} on
\lambda_u, \lambda_l. We feel that the ratio statistic is an intuitive
and useful first step when focussing on signed support
recovery. Future work will consider the alternatives in greater depth.

Section 1.1 in the supplementary material shows that HJ can be
rewritten as a Preconditioned Lasso algorithm that uses the q left
singular vectors with smallest singular values. Huang and Jojic
proposed their algorithm HJ as a technique to avoid selecting highly
correlated proxy variables that could be introduced by certain latent
variable arrangements. They argue that in these cases removing a
small number of high-energy directions of X can be useful (see Section
4 of [4]). Theorem 1 shows that HJ is useful for improving signed
support recovery of Lasso if U_A is properly estimated. Future work will need to investigate theoretically under what practical circumstances this can be expected to be case.

By "unconstrained" spectrum matrices we mean unnormalized spectrum
matrices. That is, the matrices \hat \Sigma_S, \hat \Sigma_{S^c} can
have arbitrary positive elements on the diagonal.

In 5.3 we consider what happens when we replace the orthogonal left
basis U of Eq. (20) with a general Gaussian basis W in Eq. (26). While
a direct comparison of this model with the generative model (20)
cannot be made, we show that if the number m of rows in W grows with
the noise variance, the induced ratio of upper and lower penalty
parameter bounds for (26) can be compared with that of the generative
model (20). To do this, we show in the supplementary material (Section
4.4) that if m is large, the columns of W are almost orthogonal
(i.e. (1/m) W' W -> I, see Eqs. (155), (156)) and that for large m,
the inner product (1/m) W' w^m (where w^m is a noise vector) behaves
as a Gaussian vector due to the Central Limit Theorem (see Lemma
3). For both of these limits we need m -> infinity, while n is fixed.