NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2399
Title:Beyond Alternating Updates for Matrix Factorization with Inertial Bregman Proximal Gradient Algorithms

Reviewer 1

I've read the response and the justification for assumptions A and B are nice. The score has increased to 7. ---------------------------------------------------------------------- 1. Originality: This paper devises a new distance function applied in BGP algorithm. The framework is existing but the author gives a number of improvements and for the first time gives convergence theorem under some conditions. 2. Quality: The theoretical derivation looks sound and the experiments are rather comprehensive. 3. Clarity: The text is mainly clear, but the statement of BPG-MF algorithm is a little confusing --- one needs to read through many sections to get the whole algorithm. 4. Significance: The results are theoretically valuable as a non-alternating optimization method with thorough analysis. But in real data experiments this algorithm does not show a significant advantage in performance over the traditional methods, especially when computing resource is limited.

Reviewer 2

In this study, inertial Bregman proximal gradient algorithms (BPG and CoCaIn-BPG) were considered for matrix factorization. The problem here was the existence of L-smad property in matrix factorization. To tackle this issue, Authors proposed a novel Bregman distance, and proved the L-smad property with this distance and infer convergence of BPG and CoCaIn BPG. I think the contributions in this paper is novel and significant. In experiments, three problems (matrix factorization, non-negative matrix factorization, and matrix completion) were solved by the proposed optimization algorithms and the effectiveness compared with state-of-the-art algorithms such as PALM, iPALM was demonstrated. I think it was nice if the convergence behaviors are statistically evaluated because the stationary points obtained by non-convex optimization are usually varied by initialization. Thus, I claim that the experiments are slightly weak.

Reviewer 3

First, some background. The Bregman Proximal Gradient (BPG) method was introduced in general form with theory by Bolte et al. in [8] for nonconvex optimization, and an inertial version with convex-concave backtracking (CoCaInBPG) was introduced by Mukkamala et al. in [45]. In section 6.6 of the same paper, Mukkamala et al. apply CoCaInBPG to the problem of structured matrix factorization and show good performance, but leave the theory and computational efficiency open. Additionally, BPG without inertia was applied sucessfully to the problem of symmetric non-negative matrix factorization (SNMF) by Dragomir et al. in [20], which appears in its most recent version to also include more general (symmetric) matrix completion. The submitted paper here is concerned with applying BPG and CoCaInBPG to the non-symmetric matrix factorization problem, essentially picking up where [45] left off and providing work complementary to [20]. This paper first restates the results of [8,45] for the specific objective (1.1) of matrix factorization, then makes its two primary contributions. First, the paper introduces a kernel generating distance function h that is appropriate for matrix factorization, which is related to the universal function of [20] but extended non-trivially to the non-symmetric case. This proves convergence theory to a stationary point of the method with this distance function. Second, the paper introduces closed form solutions for the updates corresponding to a variety of different variants of (1.1) with this kernel. This makes the methods usable in practice and makes the theory more interesting and is a welcome contribution. The quality and clarity of the paper is in general high. It is well-written (barring the need for a once-over to fix a few typos) and easy to understand, and the results are well-presented. This is not a critique of the paper itself, but I would like to note that there is something wrong with the citation system that made this paper slightly arduous to review. For example, on L29 it appears [4] is cited twice in one citation, but in fact on further inspection it appears that half the time that [4] is cited the appropriate paper is actually [45] -- in other words, CoCaIn BPG was introduced by Mukkamala et al. ( and not by Bauschke et al. (Mathematics of Operations Research, 42(2):330–348, 2017.). In fact, every citation is truncated to the first digit of the citation number. It is possible that this is a rendering error on my machine, but I wanted to bring this to the authors' attention. Additionally, as a general style comment the citations appear copious in general when it comes to repeating citations for the same paper at multiple places in a paragraph. I did not check all the math in the supplementary material, but I trust that the results the authors give are correct. The form of the Bregman generating function in section 2.2 seems to make sense in light of the relation to the universal kernel of [20], and the closed-form updates as well seem morally correct. I do wish that more time was spent discussing the overall time-complexity of the algorithm beyond what is mentioned in line 241 -- in particular, adding an explicit line regarding discussion of whether the cost of an iteration is really comparable between methods or not would be great (and the timing results in the appendix seem promising!). The results seem original, as there is clearly recent and parallel related work but the non-symmetric case treated here is more general in some ways. What I am not entirely sure about is the significance of the results here. In particular, it is difficult to understand what point the numerical results in section 3 are trying to make ("We aim to empirically illustrate the differences of alternating minimization 257 based schemes and non-alternating schemes[.]"). Very many of these results seem to show competing methods converging to a different stationary point than the methods of this paper. However, that is the nature of non-convex optimization. I do not believe that the authors have theory to say that the non-alternating schemes should converge to a "better" stationary point, so it doesn't really seem like we can draw too much of a conclusion from Figure 1 or Figure 2. In other words, why should we choose the schemes of this paper over other schemes? Is there any reason we should expect the results of these tests to generalize? What makes BPG / CoCaIn BPG a better choice of optimization algorithm than PALM-methods in general? However, not every method has to be immediately "SOTA" and so, that said, I think the paper is a nice contribution. -- Having read the authors' response and the other reviewers comments, I am quite happy with the direction of the final paper. The authors clearly outlined a number of substantial benefits of their approach over competing methods and promised to add these to an extended discussion section. Further, they have put together an additional numerical experiment to demonstrate more clearly the improved empirical performance over competing methods, and the initial results they report are promising. I am raising my score, as the authors have addressed my primary points from (5) below.