NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2543
Title:Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning

Reviewer 1


		
Originality: the paper proposes to address exploration in reinforcement learning using a combination of posterior sampling, successor features, and Bayesian linear regression. To the best of my knowledge the proposed combination is novel. The authors also do a good job contextualizing the proposed method within the related literature. Quality: the paper is well written and seems to be technically correct. The claims are supported by both theoretical and empirical results. The authors are also upfront about the limitations of the proposed approach (Section 4.4). Clarity: although the paper is well written, the presentation could perhaps be slightly improved in two points. First, with the exception of Fig. 1, there is a lack of intuitive explanations that may make it difficult for a reader less familiar with the subject to grasp the ideas at first. Second, the narrative behind SU seems a bit entangled with the use of neural networks --e.g., line 19--, although some of the theoretical arguments in favor of it actually rely on a tabular representation. I wonder if it is possible to present the SU concept in isolation and later argue that it has as one of its benefits the fact that it can be easily combined with complex function approximators. Significance: the paper proposes a method to perform "deep exploration" in RL. The method is simple and has low computational cost --as discussed in line 188, it can be seen as a small modification of previous methods resulting from the structure imposed in (5). As such, it seems to me that it has the potential of being adopted by the community and also serving as an inspiration for future research. Post-rebuttal update: ------ As stated in my rebuttal, I think this paper is quite strong content-wise. It could benefit from a clearer presentation, though, as indicated in the reviews. I hope the authors put some effort in making the next version of the paper more accessible.

Reviewer 2


		
This paper proposes using Bayesian linear regression to get a posterior over successor features as a way of representing uncertainty, from which they sample for exploration. I found the characterization of Randomised Policy Iteration to be strange, as it only seems to apply to UBE but not bootstrapped DQN, With bootstrapped DQN, each model in the ensemble is a value function pertaining to a different policy, thus there is no single reference policy. The ensemble is trying to represent a distribution of optimal value functions, rather than value functions for a single reference policy. Proposition 1: In the case of neural networks, and function approximation in general, it is very unlikely that we will get a factored distribution, so this claim does not seem applicable in general. In fact, in general there should be very high correlation between the q-values between nearby states. Is this claim a direct response to UBE? Also the analysis fixes the policy to consider the distribution of value functions, but this seems to not be how posterior sampling is normally considered, but rather only the way UBE considers it. A straightforward approach to posterior sampling would consider a distribution over optimal value functions, rather than being tied to any specific policy. It is confusing that this analysis is presented as being standard in posterior sampling literature. Proposition 2: While yes, you can easily change your value function to maintain the same policy, and thereby breaking any constraint that your value function may satisfy, this does not mean that it is a good idea to get rid of propagation of uncertainty. The key issue to consider is what happens as your posterior changes. Is there a consistent way to update your distribution of value functions that keeps it matching the posterior sampling policy as the posterior is updated with new data? This proposition only holds for a fixed policy, but PSRL is not sampling from the posterior of models according to a fixed policy. So even if the sampling policies match, they only partially match the true sampling policy of PSRL, which seems quite limited. This limitation is mentioned later on in the paper, but it is unclear why proposition 2 is a desirable quality to have in the first place. Wouldn't one want to match the true sampling policy of PSRL, even at the cost of violating proposition 2? Also the proposed Successor Uncertainties algorithm still propagates uncertainties, so it's even more unclear what the purpose of this proposition is. The experimental results in chain and tower MDPs are quite promising for SU. It seems to show that SUs gracefully scale down to tabular settings. The results in Atari do seem to show improvement over bootstrapped DQN. Overall, the idea of using Bayesian linear regression on successor features seems to be a promising direction for research, and the experimental results back this up. However the theoretical analyses are confusing and not well motivated. **Update after rebuttal** Thanks to the authors for willing to clarify the definitions and propositions. With more clarity and motivation, I am willing to move to a 7.

Reviewer 3


		
The work starts by highlighting potential limitations of PSRL-inspired model-free exploration methods with intuitive propositions. To overcome these drawbacks a framework for decoupling the reward and transition uncertainties by modelling the reward function via BLR is proposed. This architecture is used along with a TD-successor feature learning with an additional Q-values constraint. They conclude with experiments on tabular MDPs and Atari 2600 games. Originality - The decoupling of uncertainty via BLR of rewards is an interesting direction for driving exploration. Quality - The first-half of the paper is very clearly written. But the second half from Section 4 is lacking adequate motivation for the method and structure. Clarity - The part from Section 4 while a proposal for the algorithm, lacks adequate motivation for why the approach would not suffer from the drawbacks highlighted in part 1. It is unclear whether SU is an interesting exploration algorithm because it overcomes the limitation of Proposition 2, or because it satisfies Definition 2. While the authors acknowledge some limitations of the method in Section 4.4, the two parts of the paper seem rather incongruent. Further (1) algorithmically it is unclearn how the uncertainties drive exploration — greedily/stochastically? (2) there is a disconnect between the pseudocode in the appendix and the last paragraph of 4.2 (3) while the complete theoretical section in Section 5.2 is wrt Figure 1 is interesting, the failure of BDQN and UBE is surprising — if the constants are high, I do not see how they fail so much. (4) Section 5.3 seems unnecessary and is rather unclearly presented (5) y-axis in Figure 4 — clipped or is-between? (6) Why do the embedding need to satisfy said properties in Section 4.1? (7) Successor Uncertainties seems to be a confusing name considering the proposal models only the uncertainty in the reward function explicitly. How does modelling reward uncertainty compare to modelling transition uncertainty? I understand it is briefly discussed in Section 4.4., but the discussion seems to say "we can benefit from modelling successor uncertainty" - mostly rendering the name a misnomer. (8) Section 5.2 is rather unclear — tied actions == stochastic transitions? (9) the neural network models the action as just another input? Significance - I think the work is significant in parts but the complete paper can be better organized, and the contributions of Part 2 better placed in the context of Part 1. While modelling the reward uncertainty seems promising for prorogation of uncertainty in a more robust manner, the presentation of the actual algorithm obfuscates a lot of details in the main paper. PS: I have not reviewed proofs for Section 5.1. ------ Post-rebuttal: Thank you for your clarifying remarks, and sorry about the confusion regarding the Chain MDP. I have read the rebuttal and have updated my score.