NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 1610 On the Global Convergence of (Fast) Incremental Expectation Maximization Methods

### Reviewer 1

This paper has provided global convergence analyses, with convergence rates, of stochastic EM-algorithms which include incremental (iEM) and variance reduced (sEM-VR) versions of EM-algorithms. Especially, the paper has given a convergence rate of $O(n/\epsilon)$ for iEM by applying the theory developed by Miral(2015) and a convergence rate of $O(n^{2/3}/\epsilon)$ for sEM-VR by showing sEM-VR is a special case of stochastic scaled-gradient methods. In addition, a new variance reduced EM-algorithm named fiEM based on SAGA has been proposed with its convergence analysis as well as sEM-VR. Finally, the superiority of variance reduced variants (sEM-VR and fiEM) has been shown via numerical experiments. Clarity: The paper is clear and well written. Quality: The work is of good quality and is technically sound. Originality: Although proof techniques are not so novel from the optimization literature, this type of the convergence result of stochastic EM-algorithm is first in this context except for a few studies. The following paper is missing, which also provided non-asymptotic convergence analysis of stochastic EM-algorithms as a special case of their results. I want the author to discuss the relationship with this study. - B. Karimi, B. Miasojedow, E. Moulines, and H. Wai. Non-asymptotic Analysis of Biased Stochastic Approximation Scheme. COLT, 2019. Significance: A convergence rate of iEM is not surprising; indeed, it corresponds to that of vanilla gradient descent. However, a convergence rate of variance reduced variants is interesting because the dependency on $n$ (the number of examples) is much improved by the variance reduction. Thus, I think this paper makes a certain technical contribution to the machine learning community. A minor concern is a mismatch between this convergence rate and empirical convergence rate. Concretely, an empirical convergence rate of sEM-VR and fiEM seems linear as shown in Fig. 1. ----- I have read the authors' response which well addressed my question, so I raise the score to 7

### Reviewer 2

This paper focuses on incremental EM methods for large datasets. The authors discuss the relationships between some previous incremental EM methods and propose a new incremental EM algorithm, the fiEM algorithm. The authors analyze the non-asymptotic global convergence rates of these algorithms, and compare them empirically. Originality: 1. The new algorithm fiEM is an interesting extension of the original algorithm iEM [1]. It is a good extension since it is not complicated but efficient. Quality: 1. The global convergence non-asymptotic rate analysis for the incremental EM algorithms is interesting. It showed us that the sEM-VR algorithm [3] and the new algorithm fiEM are faster than the previous algorithm iEM [1] (in expectation), which is an interesting result. It would be more interesting if the authors can provide a more detailed comparison between the sEM-VR algorithm and the fiEM algorithm since they all require O(n^{2/3}/\epsilon) number of iterations (in expectation). 2. Most of the assumptions that the authors make for the proof are reasonable, except for that I am not sure if H1 is satisfied in many settings. The sets S and Z may not be compact in many settings. 3. Empirically, the fiEM algorithm and the sEM-VR algorithm work well compared to other methods. However, on one of the real datasets in Section 4.2, the fiEM algorithm was outperformed by the previous algorithm sEM-VR. Clarity: 1. The notation of this paper is clear, but not very easy to follow since the authors used some notations before definition. It would be better if the authors can adjust the text for easier understanding. Significance: 1. The global convergence rate analysis is interesting and useful, although the experiment results may need to be improved. Also, it will be better if the authors can work on more models and datasets on the experiments. References: [1] Neal, Radford M., and Geoffrey E. Hinton. "A view of the EM algorithm that justifies incremental, sparse, and other variants." Learning in graphical models. Springer, Dordrecht, 1998. 355-368. [2] Mairal, Julien. "Incremental majorization-minimization optimization with application to large-scale machine learning." SIAM Journal on Optimization 25.2 (2015): 829-855. [3] Chen, Jianfei, et al. "Stochastic Expectation Maximization with variance reduction." Advances in Neural Information Processing Systems. 2018.

### Reviewer 3

Summary: This paper studies the convergence of incremental and stochastic Expectation-Maximization (EM) algorithms. Authors establish non-asymptotic bound for the averaged EM iterates by building on the framework developed by Mairal 2015. Authors use the term 'global convergence' yet the rates are given for a first-order critical point and assuming that there is a unique minimum. ** Major comments: - Under H4, the problem already has a unique global minimum with positive definite Hessian. Any first-order critical point will be a global minimum. The result is only global because of this assumption. - Authors use the MISO framework; yet no description of MISO is provided in the paper. - In the theorems, it is not clear what the expectations are over. For example in E[\bar{L}(\hat\theta_0)], which random variable is this expectation over? - The bound is given for E[\nabla L (\theta^K)] where K is a uniform random variable between 0 and the last iteration K_max. This means that the bound is simply on 1/K_max \sum_k E[\nabla L (\theta^k)]. Using their theorems authors can give an upper bound on the averaged iterates or by using the inequality 1/K_max \sum_k E[\nabla L (\theta^k)] \geq min_k E[\nabla L (\theta^k)] the best iterate. The current way of writing the bound seems a bit opaque. At least a discussion or a brief remark on the implications of the result is needed. - Theorem 1 of Paper 1610 is a special case of Theorem 1 in Paper 1613. ** Minor comments: - line 1, algorithm -> algorithms - line 20, comma after g - line 65, to find to - line 156, tailored %%%%%%%%%%%% I thank the authors for answering my questions. Authors clarified my concern on assumption H4. But my concern on Theorem 1 of submission 1613 still remains. Based on this I am updating my score to 5.