NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID:198
Title:Optimal Sparse Linear Encoders and Sparse PCA

Reviewer 1

Summary

The paper considers the problem of designing “sparse linear encoders”, i.e., a linear transformation to obtain a low dimensional representation of the input data that preserves as much information as possible while satisfying additional sparsity constraints. In particular, consider a d-by-k matrix H whose columns have at most r nonzero entries. Every d-dimensional data point x is represented by a k-dimensional (encoded) vector z = H’x. The matrix H is the “encoder”. For a given H, we can also specify a “decoder” G, that maps points from k dimensions back to the original d-dimensional space, while minimizing the loss over all datapoints: ||X-XHG||_F (points are rows of X). The question is then how to design the optimal sparse encoder H. The main result is an algorithm for designing a sparse linear encoder H, using an existing algorithm for the Column Subset Selection problem as a black box. The authors establish a reduction and are able to give approximation guarantees for the linear encoding problem based on the guarantees of the CSS algorithm. They also propose an alternative approach based on the same algorithm, but computing the components (columns) of the encoder one by one. They also provide the first theoretical analysis (i.e. approximation guarantees) for this case as well. They conclude with experiments comparing their approach against existing approaches for sparse PCA which focus on the objective of maximizing variance. They show that the two objectives can result to different linear encoders which is interesting.

Qualitative Assessment

The paper has an interesting theoretical contribution. The objective of obtaining a low-dimensional representation of the data that minimizes a Frobenius representation error is common. The paper draws connections of the above problem with sparse PCA showing that it can be seen in fact as a generalization of the latter. The authors also have a useful discussion explaining the above “loss minimization” objective differs from (and is perhaps of greater interest to the ML community compared to) the “explained variance maximization” viewpoint for PCA (even though the two objectives are equivalent in the case of PCA). As mentioned above, the authors propose an algorithm for designing a sparse linear encoder utilizing a reduction to the CSS problem and using an existing CSS algorithm as a black box. The obtained linear encoder has the restriction (or property) that all k dimensions (columns of H) have the same r nonzero coordinates. The theoretical guarantees are interesting. The main theorem (Thm 7) gives a multiplicative factor guarantee for the expected achieved error with a factor that depends on the choice of the sparsity parameter r. For instance, the theorem implies that for sparsity r = Omega(k/epsilon), the linear encoder achieves a low-dimensional representation with error comparable to the best (unconstrained) rank-k approximation. The lower bound in Thm 8 also nicely supplements the previous result. The subsequent segment sketching the proof of Thm 8 could be improved. Among others, what is B and Y in Lemma 9? In section 3, the authors describe that their approach can be used to compute the k columns of H sequentially and also provide theoretical approximation guarantees. Of course the analysis of the sequential/greedy approach is more challenging and those guarantees are much weaker compared to those for the batch algorithm. For example, the error bound in Thm 11 includes an additive term that depends on the sum of the top singular values sigma_2^2 +...+ sigma_\ell^2. This error term would scale proportionally with the dimension of the input data. Does the “small additive term” in the subsequent comments refer to this error term? They conclude with experiments comparing their algorithm with algorithms for sparse PCA. They show that their algorithms gives low dimensional representations with lower reconstruction (Frob) error, but perform worse with respect to the “variance” metric, supporting the claim that the two objectives result in different encoders. Minor: Line 48: sentence ends at the comma. In the comments under Thm 7, there is a stray “2.” Lines 159-160 seem to repeat the comment in 155. “Matrrial” in line 212

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 2

Summary

This paper addresses the problem of sparse PCA. It argues that the right metric to optimize when fitting a sparse low-rank model is the reconstruction error, as opposed to the explained variance. It proposes a method that produces a sparse low-rank approximation to the data given an algorithm that selects a subset of columns which provide a good low-rank approximation. In addition it explains how to adapt the method to compute sparse components iteratively. The paper derives guarantees for both algorithms and provides a lower bound that implies that they are optimal in a worst-case sense.

Qualitative Assessment

This paper makes a very interesting point about what metric should be used while computing sparse low-rank models (minimizing l2-norm loss as opposed to maximizing variance) and provides non-trivial results about how to adapt previous work on column-subset selection to the sparse PCA problem. The paper draws from previous papers on approximating matrices with a subset of their columns, but the application of these results to sparse PCA is novel to the best of my knowledge. I should emphasize that the contribution of the paper is theoretical rather than methodological because it relies on the column-subset selection technique for producing a good low-rank approximation, and this could be a strong assumption in practice. Also, it is debatable whether worst-case performance is the right metric to evaluate dimensionality-reduction algorithms (when applying this type of models we are very seldom interested in approximating datasets that are adversarial), but it is definitely an indication that method may be of practical interest.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 3

Summary

The premise of the paper is that variance maximization and loss minimization (which are equivalent in linear PCA) lead to different results for non-linear PCA algorithms. Variance may not be well defined (when factors are not decorrelated), or very different, (in the sparse case). In previous work (e.g. truncated power method, generalized power method) only explained variance was optimized. The present paper is the first to optimize for loss, and gives several algorithms to do so. It provides optimality and convergence guarantees for these methods.

Qualitative Assessment

My main issue is that the authors never give a convincing explanation why loss is a superior metric. If they criticize previous work for using a less useful metric, this should be clearly explained by defining and comparing both metrics (maybe with an example where variance fails). The structure of the paper should also be improved. Introduction reads like Methods and does not put the work in any context. Related work is only introduced on the penultimate page under "Demonstration". Similarly, the table with results (line 49) should not be in the Introduction, but moved to a Results section. Algorithms and tables should be captioned and numbered just like figures. All technical terms, for example: "(1+epsilon) relative approximation" should be defined or referenced.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 4

Summary

This paper develops polynomial time algorithms for computing near optimal sparse linear autoencoders. One algorithm operates in a batch manner in that all k factors are constructed simultaneously. Another algorithm iteratively computes a sparse factor from a residual matrix that is updated in each step. Theoretical guarantee is provided for this iterative algorithm. Numerical comparisons with other sparse PCA methods are also illustrated.

Qualitative Assessment

I found the iterative algorithm for constructing sparse linear encoders interesting. The proof showing that this polynomial algorithm achieves optimality is potentially significant. The author(s) could improve the current presentation in some ways. For example, it would be nice if the authors could clearly distinguish contributions in the paper from what has been known. More numerical experiments could also provide more informative comparisons between the new algorithms and the existing ones.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 5

Summary

This paper proposes a "sparse-PCA" problem which has different concept from the traditional sparse PCA. The goal of this problem is to find an optimal sparse linear encoder to minimize the loss of reconstruction. This paper gives a polynomial sparse encoder algorithm with theoretical guarantee. Matching lower bound on the sparsity of encoder is also proved by the author. The paper also provides an iterative algorithm which also has better empirical performance.

Qualitative Assessment

The paper provides a new respective on sparse PCA, i.e., minimizing loss rather than maximizing variance. The problem is clearly defined and the authors provides sufficient details of the algorithms. The overall structure is good. One drawback is the practical value of this concept. Since the goal of supervised learning is to get small error on the test data, the author should give more evidence to justify that the new concept can achieve this goal. The author only compared the cross validation error with other sparse PCA method. The paper would be more solid if the author can compare this method with other dimension reduction methods such as the neural network auto-encoder.

Confidence in this Review

1-Less confident (might not have understood significant parts)


Reviewer 6

Summary

The authors develop a method for sparse PCA of vector data based on minimization of information loss. They provide theoretical bounds on the level of sparsity needed to achieve an epsilon approximation of the information loss obtained by PCA, the optimal linear encoder. They develop both a 'batch' version, computing k leading components at once, and an iterative version, sequentially computing the sparse components. The method is evaluated on a number of datasets.

Qualitative Assessment

Overall the paper was an interesting read and easy to follow, despite some problems with notation and some grammatical errors. (For example, what is the matrix A in problem 1?) The authors present a new algorithm for sparse PCA along with theoretical guarantees on sparsity vs. approximation to PCA and computational complexity. The analysis of the trade-off between sparsity and approximation is thorough and valuable. The fact that the components have the same locations for their loadings is useful for the interpretability of the extracted components. A key argument made by the authors is that minimization of information loss (or, reconstruction error) is a more adequate objective than maximizing the explained variance. They give three main reasons for this: The first is a rather vague claim that variance has no inherent value and that optimizing with respect to loss is more favorable. Next, they claim that an algorithm that approximately minimizes loss can be converted to one which approximately maximizes variance (In theorem 2 of the supplementary material, what is rho?), but that the converse is not true. In this linear setting, I am not certain that I understand why this is the case. Finally, they rightfully point out that when the components are constrained to be sparse and generally not uncorrelated, the explained variance is undefined. However, given the deflation methods suggested, for example, by Mackey, 2009, this issue can be resolved. So, I am not convinced that this objective is superior to the maximization of variance. In terms of the experimental results, the authors demonstrate that their algorithm minimizes the information loss as compared to other methods. Given that this is the explicit objective of their approach this is to be expected. So, while the method indeed shows promise, these results are insufficient evidence to favor their method. More instructive are the results brought in the table in row 49. These demonstrate that this method obtains a smaller classification error using an SVM classifier. Nevertheless, more details on the experimental setup are necessary in order to understand the significance of these results.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)