Paper ID: | 1247 |
---|---|

Title: | Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update |

The paper proposes to use episodic backwards updates to improve data efficiency in RL tasks, furthermore they introduce a soft relaxation of this in order to combat the overestimation that typically comes from using backwards updates when using Neural Network models. Overall the paper is very clearly written. My main concerns with the paper are in the experimental details as well as in the literature review, also when taking into account the existing literature the novelty of the work is quite limited. The idea of using backwards updates is quite old and goes back to at least the 1993 paper "Prioritized Sweeping" by Moore and Atkeson, which in fact demonstrates a method that is very similar to what the authors propose and which the authors fail to cite. Furthermore recently there were quite a few papers operating in a similar space of ideas using a backward view in ways similar to the authors, e.g.: Fast deep reinforcement learning using online adjustments from the past, https://arxiv.org/abs/1810.08163 RUDDER: Return Decomposition for Delayed Rewards, https://arxiv.org/abs/1806.07857 Is prioritized sweeping the better episodic control?, https://arxiv.org/abs/1711.06677 Efficient Model-Based Deep Reinforcement Learning with Variational State Tabulation, https://arxiv.org/abs/1802.04325 To name a few, these works should be included in the related work section and the precise differences/advantages relative to what the authors propose discussed. Their experimental section is overall well written but there are a few points of concern: It is not clear how the authors picked their hyperparameters, were those selected to give good performance after 20M frames? Did the authors apply the same hyperparameter selection criteria to the baselines? In a recent paper (https://arxiv.org/abs/1906.05243) it has been shown that dqn can be made vastly more data efficient at 10M frames if hyperparameters are optimised to do so. So it's very important in a work focusing on data efficiency to select hyperparameters of baselines and the proposed algorithm according to the same criteria, it is not clear to me that the authors did so. For example from the supplementary material the authors seem to be training their model on a batch every agent step (or 4 frames) while anything DQN based like their prioritized replay baselines will only do so every 4 transitions (or 16 frames), which is in fact one of the changes the authors of https://arxiv.org/abs/1906.05243 advocated to make dqn more data efficient. Also the training procedure for adaptive diffusion parameter is confusing, doesn't the procedure the authors advocate make the training K times less data efficient? Otherwise I believe the paper to be well written and interesting, so my main concern would be with a lack of novelty given recent research as well as some lacking details regarding the experiments. Post Rebuttal: Overall the authors adressed my main concerns. However I found the response of the authors overly combative, especially since one of my main concerns with the paper was a factually incorrect statement regarding the frequency of updates to the policy. I would encourage the authors to take more care in the future when formulating responses to reviewers. Nevertheless the authors adressed my main concern and accordingly I changed my score to 6.

The paper is well written and organized, the appendix is extensive. A new heuristics, called episodic backward update (EBU) is presented. The heuristics aims at increasing data-efficiency in sparse reward environments. To mitigate the instability, that this less random approach introduces, a diffusion parameter is proposed together with EBU. A method for adaptive selection of the diffusion parameter is presented. This is replacing a more general approach by a heuristic, that even needs another parameter to be robust. This is not a good idea in general, but in the domain of sparse rewards the paper shows, that the heuristic has its advantages. So it might well be a method that will be used in these domains. The plots in Figure 5 are very interesting, showing some problems of the methods (which I believe are common in Q-function based methods, but not so often discussed). A few suggestions/comments: The comparison of "39 days" with "a couple of hours" is quite unfair, as in the former case, vision has to be learned as well. In subsection 3.1 the Q is not bold, you could use \boldmath{Q} The font of Figure 2, 4, 5, and 6 is very small, practically unreadable.

General comments: the paper presents a theoretically sound algorithm which also boosts practical performance of deep RL algorithms. The motivation of the original n-step Q-learning update is to ensure that reward signals can propagate effectively throughout a sampled trajectory, which is aligned with the method proposed in the paper. To address the issue with correlated states and overestimation, the author proposes a diffusion scheme that mediates the learning. Detailed question: 1. The method seems simple yet effective. I feel like a major issue with the implementation is that sampling a whole episode is more complicated than sampling a fixed length trajectory. Does this occur in practice? 2. In Section 3.3, the author introduces an adaptive beta scheme, which could potentially allow for doing without an additional hyperparameter search. However, my understanding is that in practice, the algorithm trains K networks with varying beta in parallel and picks the best one to proceed. This feels like carrying out hyperparameter search on the fly instead of being 'adaptive'? 3. If the n-step Q-learning has a large n (such that n > T, T the episodic horizon), will this correspond exactly to the episodic backward update?