NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 7188 Learning step sizes for unfolded sparse coding

### Reviewer 1

The proposed method of step size learning for ISTA with fast convergence, as well as the extension to LISTA step size approximation is novel and interesting. The proposed methods are all theoretically grounded, with detailed analysis. The presentation is also clear. The paper can be further improved with more numerical experiments to demonstrate the effectiveness of the learned model.

### Reviewer 2

Update: I feel the authors satisfactory addressed my comments: they now included comparisons with back-tracking, showing indeed that it is on par (or perhaps better) than their Oracle-ISTA. This is an honest clarification that I appreciate. While they did not write the entire proof of prop. 4.1, I can now see how this could be made correct and clearer, and I look forward to seeing the full detailed proof. I also appreciate their other smaller modifications (numerical experiments). ---------------------------------------------------------------------------------- While this is an interesting paper, easy to read and with new contributions, there are a few points that are somewhat troubling. My main concerns regard the following points: 1) The improvement of convergence speed by increasing the step-size due to the observation that not all atoms are used is natural. Yet, there exist other alternatives to do the same: one could simply employ ISTA with back-tracking. If there are larger step-sizes to be employed, such an alternative would find them too. Why then should someone use OISTA instead of simply back-tracking? To my understanding, this point deserves a discussion and numerical comparison when appropriate (say, for Figure 2). 2) The conclusion of the authors seems to be that all is needed for acceleration is to learn better step-sizes. However, by greatly limiting the number of parameters that are learned, one must be certainly incurring in some restrictions. For instance, the work in [2] shows that when a particular factorization of the dictionary Gram is possible, acceleration is obtained. Such an acceleration would not be possible in the proposed SLISTA. Could the authors comment on this? 3) It is interesting that the authors adopt an unsupervised approach for learning the step-sizes. But then again: why not simply set them through backtracking? This would not require any learning, it is conceptually simpler, and would likely be faster. 4) Proposition 4.1 is not clear to me. In the proof, the authors employ the convergence of ISTA. However, this refers to the convergence of the iterates, and has little to do with the convergence of the learning problem in Eq (15). More generally, the authors mention (line 182) that the learning converges, while the loss (15) is a non-convex function of the parameters (\alpha^t). Could the authors explain? Further comments: - The authors seem to refer (most times) to the overcomplete case where m>n. However, this is not mentioned anywhere, and it is not a minor detail. Note, for instance, that if n>>m, the iterates will not be sparse. Consider perhaps just adding a comment on this in the beginning. - In Eq (8) (and beyond), shouldn't $supp(z)\subseteq S$? - line 136: saying that $\gamma \gg 1$ is a bit of a stretch: in a typical redundant case, $\gamma = 2-5$. In addition, consider mentioning that when n>m, k might be equal to m and so Ls/L \approx 1. - In Figure 4, the authors show that the learned step-sizes are larger than the oracle ones obtained by computing Ls. This seems to imply that the learned approach provides even faster convergence than the Oracle ISTA (because \alpha(t)>1/Ls). I might be missing something, but how come this is possible without diverging? - In the experiments section (5), the authors study a semi-real case by creating a dictionary of (8x8) digit images. The input signals seem to be constructed as (normalized) Gaussian noise samples (line 244). This makes no sense to me - digit atoms will not be able to sparsely represent noise. In fact, the constructed dictionary will likely not even span R^{64}. A more sensible approach would have been to take as input other samples from the dataset not included in the dictionary. - line 188: "any compact ..." domain? - line 237: normalized to 'unit norm'? [1] Moreau, Thomas, and Joan Bruna. "Understanding trainable sparse coding via matrix factorization." ICLR

### Reviewer 3

The paper utilizes the fact that when the support of sparse code is identified, larger step size can be used to speed up the convergence. If I understand it correctly, as alpha/beta*W = D, then alpha*W in (11) is replaced by beta*D, and the authors claim that only tuning beta (which is replaced to alpha in (14)) is enough. However, in the experiments, there're no results on the effects of different step sizes. Another issue is that in section 5, I cannot tell why SLISTA is better. And I am curious about why ALISTA is nearly not working at all. Also, the authors mention that this LISTA style model can be trained unsupervised or supervised, are there any empirical evaluation on that? In fact, the claim that ISTA does not converge for the supervised problem in general seems odd. Is there any support for this claim? Finally, in comparison to ALISTA, this paper is less general and comprehensively analyzed. For example, the convolutional LISTA and robust version are interesting to consider.