NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID:280
Title:Implicit Reparameterization Gradients

Reviewer 1


		
EDITED AFTER REBUTTAL: I appreciate the authors' effort to make the notation consistent and improve the experiments by taking into account the computational complexity. There are a couple of comments that were not addressed in the rebuttal, but I still recommend acceptance. For completeness, I am listing these comments below: + Experiments without optimizing the hyperparameters in the LDA experiment. + Experiments to showcase truncated distributions. + Comparison / Discussion about the approach of Roeder et al. for mixtures (they also integrate out the discrete component). ------------------ This paper proposes a method to reparameterize distribution that are considered "non-reparameterizable", such as the Gamma or vonMises, as well as truncated distributions and mixtures. The proposed approach is compared to other baselines such as RSVI and is shown to outperform them. The paper addresses a relevant problem and the proposed solution is clear and elegant. My first comment is that this method has already been proposed in the paper "Pathwise Derivatives Beyond the Reparameterization Trick" [Jankowiak & Obermeyer, ICML 2018]. However, I still think that there is value in the Implicit Reparameterization Gradients, for three reasons. First, this paper approximates the CDF of the gamma without resorting to Taylor approximations. Second, this paper has more experiments, including experiments with the vonMises distribution. Third, accepted ICML papers where not available at the time of the NIPS deadline, so [Jankowiak & Obermeyer 2018] should not be taken into account. That said, the citation should be included. My two main criticisms about this paper are: 1. The notation is not consistent. 2. The experiments can be improved; e.g., by showing the evolution of the marginal lower bound. 1. The notation in the paper is not consistent, please revise and improve it. For example, the parameterization of a distribution (say, q) in terms of some parameters (say, phi) is sometimes written as q(z|phi) and sometimes as q_phi(z). Please use one. More comments on the notation: + Eq. 9: Replace u with epsilon. Variable u has not been defined anywhere else. + In the mixture example (line 116), it should be q_phi(z) instead of p(z|phi). + In the LDA experiment, alpha is both the hyperparameter of the Dirichlet (line 214) and the topics (Fig 2a). Please remove the "alpha" in Fig 2a. 2. Regarding experiments: + In both the LDA and VAE experiments, can you show the evolution of the variational lower bound w.r.t. wall-clock time? The paper states that each step RSVI step is 15% slower, but the information about convergence of each method is missing. + In section 6.1, the paper states that "The results presented on Fig. 1 show that the implicit gradient is faster and has lower variance than RSVI". In Fig. 1, however, only the computational time for the vonMises distribution is reported, but not for the Dirichlet. Is it also faster in the Dirichlet case? Does that hold for any value of the shape augmentation? I think this should be added to the paper. + In the LDA experiment, I think it would be better not to optimize the hyperparameters, because the goal is to assess how good the different variational approximations are for a fixed model. If the model is not fixed (by fitting alpha), then the comparisons get more blurry. That said, I wouldn't expect any significant difference with the results reported in the paper, but it would be nice to check that. Finally, I also have two more comments. The paper explains how to reparameterize mixtures and truncated distributions, but it does not include experiments with those. It would also be nice to see an experiment with truncated distributions. In addition, mixtures can be reparameterized by following the approach of "sticking the landing" [Roeder et al., 2017]; is there any reason why either method should be preferred?

Reviewer 2


		
This work proposed an implicit reparametrization trick for gradient estimation for stochastic random variables in latent variable models. The main contribution of this paper is Eq (6), which only requires the differentiation of standardization function, no any (approximation) inverting. Compared with other generalized reparametrization methods (e.g., RSVI), the proposed approach is more efficient. From the perspective of reproducible work, TensorFlow has already been enable this feature in the nightly version. In general, this is a simple but practical idea.

Reviewer 3


		
This paper presents a new method proposes a new class of reparameterized gradient estimators based on implicit differentiation of one term in the expected gradient. Explicit reparameterization depends on the existence of an analytically tractable standardization function. Every distribution with an analytically tractable CDF has such a standardization function, since the CDF is invertible by construction and maps from samples into Uniform[0,1], but many continuous distributions of research interest (Von Mises, Gamm, Beta, Dirichlet) do not have tractable inverse CDFs. Prior approaches have approximated this intractable term or applied a score function estimator to all or parts of the objective to avoid reparameterization. This work derives an implicit differentiation estimator for this term—the gradient of the latent variable reparameterized through an invertible, differentiable function of an independent random variable and the parameters of the distribution to learn. An implicit gradient estimator for this term sidesteps the need for inverting the CDF and/or standardization function, and thereby expands the class of distributions to which reparameterization can be easily applied. The mathematical derivation is pleasingly straightforward, the paper very well written, and the empirical results convincing. The conclusion contains some interesting pointers for future work for expanding implicit reparameterization gradients into the class of distributions for which a numerically tractable inverse CDF or other standardization function is not known. I recommend publication and expect this technique to expand the class of commonly-used variational approximate distributions that are commonly applied. The scope and novelty of this contribution is not huge since, at its core, it will only expand the scope of reparameterizable distributions. But, the insights and presentation are very valuable and interesting.