Reviews: VIREL: A Variational Inference Framework for Reinforcement Learning

This paper brings an novel perspective on probabilistic frameworks for new reinforcement learning algorithms, and the adaptive temperature reweighting may lead to more insightful exploration built into our RL algorithms. The paper is written clearly, and is also well-organized and easy to understand, and the appendix is structured clearly as well, although the full length of the appendix + paper makes the paper a little unwieldy to read. The authors have clearly put in a lot of work into developing the theory and presentation in this paper, and although empirically the performance of the derived algorithms do not show significant improvement over max-ent RL methods (with twin Q functions as in TD3), the approach is interesting and I believe this paper would be well-suited for NeurIPS. Some specific comments: - In the definition of the residual error on L147, over what distribution is the L^p norm being referred to? - Instead of e_w being a global constant, have the authors considered parametrizing e_w as a function of h - this would allow for state-adaptive uncertainty and exploration, and I believe a majority of the results would still hold. - On L96, L227-229, the paper claims that MERLIN relies "on a variational distribution to approximate the underlying dynamics of the MDP for the entire trajectory". However, most works with the Max-Ent framework parametrize variational distributions through only the action distributions, and fix the variational distribution on dynamics to the actual dynamics model. The empirical evaluation on the Gym environments doesn't validate this hypothesis too strongly, but it would be interesting to see a more carefully designed test of this hypothesis. Perhaps results on environments that require long-horizon planning (where algorithms modelling full trajectories will be less performant) may be illuminating - Why were experiments run on different domains for the comparisons with the twin Q functions, than those run with a single Q function? - Putting an algorithm box or more extensive description of the evaluated algorithm in the main text would be useful, instead of just in the Appendix - How were hyperparameters chosen for all the algorithms?

Reviewer 2

Originality: The probabilistic model and variational inference approach is new and interesting. Related work is cited and it differs from previous work through a new adaptation of the temperature of the policies. Quality: I am not sure if I understand the definition of \epsilon_w in line 147. This seems to be recursive to me. Are you claiming that for any flexible Q there exist one (or many) softmax temperature \epsilon_w>0, so that the L^2 temporal difference error is exactly the softmax temperature? I get it in the limit but not for \epsilon_w>0 (proof?), also do you mean that the error is the same for any L^p norm? I have the same issue with the projected residual errors in Section F1. Further, the paper claims in line 231 "function approximator is used to model future dynamics, which is more expressive and better suited to capturing essential modes than a parametrised distribution". Can you explain why and give an example for this? Maybe related, the Boltzmann distribution might be more flexible than a simple tanh transformation of a Gaussian used in the Soft-Actor-Critic paper. Have you tried SAC with more flexible distributions having say multiple modes to see if the performance is due to the adaptation of the regularisation or just because of more flexible policies? Furthermore, as you consider some adaptation of the regulariser, can you compare your approach to entropy-regularised approaches that either reduce the entropy-penalty via some fixed learning rate (see for instance Geist et al, A theory of regularized markov decision processes, 2019) or optimized via gradient descent approaches (Haarnoja, et al, Soft actor-critic algorithms and applications, 2018)? Notwithstanding all these points, the submission seems technically sound in general with claims supported by theory (and with more technicalities than some related work) and experiments! Clarity: The paper is generally well written. I find the residual error in line 147 not so clear and find the introduction of the residual errors in Appendix F2 clearer and more plausible instead. Also there seems to be bit of a disconnect between the exposition of all the gradients in the main paper through a variational inference perspective, and then the algorithm pseudocode in the appendix that more or less uses policy improvement and Q-TD-error minimization. Can you elaborate more on those loss functions like J_virel(\theta), why do you have a constant \alpha there, does \epsilon_w depend on t in J_beta(\theta)? -Significance: The idea of a variational EM scheme to address RL is useful and I expect that others can build up on these ideas, either theoretically (like what can be said about performance errors) or empirically. The approach appears to be competitive with state-of-the art approaches. -Minor points: You say in line 969 that " minimising \eps{w,k} is the same as minimising the objective \eps_w from Section 3.1". Why? Is this not contradictory to lines 993-5? missing gradient after 843 w' versus \tilde{w} in 920 line 929 POST AUTHOR RESPONSE: I thank the authors for their feedback. Having read it along with the other reviews, I keep my initial score of 7. The rebuttal provides some clarification on the definition of \epsilon_w and indicates that for the Bellman operator, further theoretical work might be worthwile. They have also given some clarification concerning the flexibility of the parameterisation used for the policies. The authors also intend to reference additional related work that consider different types of adaptation for the entropy coefficient/penalty. While it would be nice to have some empirical comparison with such work, even without it, I think this is still a complete, long enough and interesting paper and I vote to accept it.

Reviewer 3

The article casts the control problem as a probabilistic inference one in a novel formulation based on a Boltzmann policy that uses the residual error as temperature. Since this policy has an intractable normalization constant variational inference is used through introducing another variational policy. The authors derive actor-critic algorithms through an expectation-maximization strategy applied to the variational objective. The authors offer extensive proofs for several useful properties of their objective: convergence under some assumptions, recovery of deterministic policies, etc/ The current work is also very well integrated with existing literature being motivated by the limitations of existing variational frameworks (the limited expressivity of the variational policy over trajectories and the difficulty to recover deterministic policies in maximum entropy approaches; and the risk-seeking policies pseudo-likelihood methods arrive at). The proposed method addresses all this limitations. Experiments with derived algorithms validate the approach by achieving state-of-the-art on a couple of continuous RL tasks. Baselines are relevant: a state-of-the-art algorithm (SAC), and another algorithm that naturally discovers deterministic policies (DDPG), the latter being closely related to one of the main claims in the article: that in the limit of the maximization of the objective the learned policy is deterministic. Considering the originality of the proposed objective, the strong theoretical treatment, the empirical validation, and also the nice exposition that places the article among related works, I propose for this paper to be accepted. Quality I consider the current article to be a high-quality presentation of a solid research effort. It does a good job in covering both theoretical and practical aspects (e.g. convergence proofs make some strong assumptions (2,4) that might be hard to meet in real setups, but discuss relaxations in the supplementary material). Originality The article builds on prior work as it starts with addressing some problems of the existing variational frameworks for RL, but it proposes an original Boltzmann policy that uses the residual error as an adaptive temperature. This strategy permits the derivation of a strategy for exploration until convergence based on the uncertainty in the optimality of the state-action value function. To the best of my knowledge, this is an original approach. Clarity The article is an excellent example of scientific writing. It does a good job in balancing the formal aspects (supported by detailed proofs in the supplementary material) with the intuition behind the different choices, and the connections with previous work (pseudo-likelihood and maximum entropy approaches). I think that in the current state one needs to have the supplementary material close in order to understand the proposed algorithms. I suggest moving into the article details such as the practical simplifications in appendix F3 (not with full detail, only enumerated in section 5). Section 5 mentions a claim from section 3.3 regarding soft value functions harming performance, but there is no such claim there. It is mentioned in section 2.2 though. Significance I consider the work to be important in the landscape of variational approaches to reinforcement learning as it solves known limitations of previous approaches and it’s both theoretically and empirically validated. Also, the empirical results show that algorithms might outperform existing algorithms on high dimensional inputs.

Paper ID:	3852
Title:	VIREL: A Variational Inference Framework for Reinforcement Learning

Reviewer 1

Reviewer 2

Reviewer 3