Reviews: Online ICA: Understanding Global Dynamics of Nonconvex Optimization via Diffusion Processes

NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona

Paper ID:	2518
Title:	Online ICA: Understanding Global Dynamics of Nonconvex Optimization via Diffusion Processes

Reviewer 1

Summary

This paper studies the dynamics of the nonconvex optimization associated with ICA, when the latter is formulated as a tensor decomposition problem. The key technicality of the paper is to connecting stochastic gradient descent (SGD) to stochastic differential equations (SDE). Analysis of the latter identifies three consecutive phases of SGD dynamics: 1. initialization at a saddle point, 2. Convergence toward a local minimum, 3. Oscillation around the local minimum. The implications of analysis of each phase on the convergence rate of SGD is discussed.

Qualitative Assessment

The paper indeed proposes an interesting and important direction toward analysis of the dynamics of SGD in a nonconvex setting. However, I am worried that the derivations might not have been worked out carefully. For example, there seems to be discrepancy about the iterates generated by SGD for the ICA problem. The right hand side of Eq (3.6) reads as |psi-3| v_k (v_k^2-\sum_i^d v_i^4). Later when analysis for the case d=2 is performed in Eq(3.7), we see 2 |psi-3| v_1^2 (v_1^2-v_1^4-(1-v1^2))^2. Note that the exponent of v appearing after |psi_3| has changed from one to two between the two equations. Which one is correct? Again, examining Eq (3.7) the authors state that: 1. this ODE has a complicated closed form, 2. as t goes to infinity, the solution (i.e. v^2) may go to zero or one, depending on the initial condition of v_1 at time zero. I could be wrong, but it seems none of these are true. Regarding your first point, the ODE in (3.8) has the following closed form solution: v^2[t]=1/(2+1/(-1+c*exp(-a*t/2))) where a=|psi-3| and c is a constant to be determined by the initial condition. Regarding your second point, when t goes to infinity, the solution simplifies to v^2[t]=1 regardless of the value of c. I do not see how you could get v^2[t]=0 when t goes to infinity. Am I missing something? Please clarify.

Confidence in this Review

1-Less confident (might not have understood significant parts)

Reviewer 2

Summary

This paper considers solving the ICA problem in the online setting, through the standard Tensor formulation (Ge et. al., 2015). This problem even though non-convex is known to have no bad local minima and SGD with additional noise will recover the factors of the tensor. This paper analyzes standard SGD without this additional noise. The main challenging part is analyzing the method around saddle points (where this additional noise was used before). This papers considers a stochastic differential equation (SDE) based framework to handle this. In particular the technique is to cast SGD as solution of an SDE and analyze the evolution of this diffusion process in three different regions, 1) Near unstable equilibrium (saddle points, local maximum), 2) Near local minima and 3) Intermediate traverse. Using this the paper guarantees convergence to a local minima with O(d^2 log(T)/T) error in T iterations.

Qualitative Assessment

Analyzing standard optimization algorithms around saddle points is an interesting open question , due to the popularity of non convex optimization in the recent works. This paper provides a new technique to do this for SGD. The achieved convergence rates seem optimal. However the main limitations seems to be that the method seems not easy to generalize to different problems as it uses the specific structure of ICA formulation heavily (for e.g., construction of W_{kk'} in section 4.2). Results of Ge et. al., 2015 are applicable to much broader class of functions (strict saddle functions). So a discussion regarding extending this method will be extremely helpful in the paper. In equation 3.8, why is there no V_2 term? How does it vanish when going from 3.6 to 3.8 for d=2? I am also lost at times in the proof when there is a mention of time/space rescaling. It might be a standard trick, but a little discussion about this too will be nice. What is the intuition for the qualitative analysis statement in lines 182, 183? Regarding the assumption that the tensor is full rank, does the analysis carry over for tensor factorization in the low rank setting? Line 118, Y_i's are used instead of X_i's. Typo in Line 123.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

Reviewer 3

Summary

Qualitative Assessment

This paper aims to study the global dynamics of nonconvex statistical optimization based on diffusion processes. As an example of non-convex problems, this paper discusses in details the SGD applied on the tensor decomposition formulation of independent component analysis (ICA). An interesting finding in this paper is a three-phases phenomenon of the global dynamics of SGD to capture the transition of SGD solution from unstable initialization to finally stable local minimum. Overall, this paper is well written and it addresses a very challenge problem. This paper serves as a first step in the theoretical understanding of global dynamics of SGD, and is believed to stimulate many work on the global dynamic of non-convex optimizations. Q1: Discuss how to generalize it other non-convex models? As a concrete example, this paper only discussed SGD for ICA. Can the authors briefly discuss how to apply their framework to other non-convex optimization problems? Q2: discuss the relationship to a series of literature which also provides global rate of convergence of non-convex optimzation problems. e.g. (1) Zhang et al. (2014), Spectral Methods meet EM: A Provably Optimal Algorithm for Crowdsourcing (2) Anandkumar, A., Ge, R. and Janzamin, M. (2014), Guaranteed non-orthogonal tensor decomposition via alternating rank-1 updates. (3) Sun et. al. (2016), Provable sparse tensor decomposition. Q3: Assumptions: In (iii) of Assumption 1, Y_i is assumed to have up to 8-th moment, i.e., k = 1,2,...,8. However, in Proposition 2, it requires the assumption for k = 1, 2,...,d, where d is the dimension of X. Can the authors discuss in details why such condition is required? When d diverges, is this condition strong? Q4: Please further polish the paper. Below are some examples of typos: (1)line 118. ``Y_1,\ldots,Y_n" should be X_1,\ldots,X_n. (2)line 123: ``...whose the number linearly grows... (3)line 139: ``...the well-known coupon collector problem [18] it takes..." (4)line 146: ``To work on the approximation we first conclude the following". This is not a complete setence.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

Reviewer 4

Summary

The authors analyse convergence of stochastic gradient descent using stochastic differential equations model

Qualitative Assessment

The authors analyze convergence of SGD in diminishing stepsize mode using stochastic differential equation as a model. This approach may have certain limitations: It is like using PDE for analysis of deterministic gradient descent. Artificial time scale may not reflect actual number of iterations needed for convergence to given accuracy. Still the approach is interesting in stochastic case and may be worth poster presentation.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

Reviewer 5

Summary

This paper introduces a new analytic framework based on diffusion processes to characterize the global dynamics of nonconvex stochastic gradient descent (SGD). The paper focuses on independent component analysis (ICA), identifying three consecutive phases of SGD for ICA: a first phase, where the dynamics are that of an unstable Ornstein-Uhlenbeck process, a second phase, where the SGD behaves as the exact solution to an ODE, and a third and final phase, where the dynamics of SGD are those of a stable Ornstein-Uhlenbeck process.

Qualitative Assessment

Although the topic of this paper is not one about which I am very knowledgable, the paper seems to be technically solid and the derivations all seem to make sense. The paper is of theoretical nature, and it would be interesting if the authors could discuss (even if briefly), what implications the findings of the paper may have in practice.

Confidence in this Review

1-Less confident (might not have understood significant parts)