NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID: 1679 Data-Efficient Hierarchical Reinforcement Learning

### Reviewer 1

Summary ======= The authors present a heirarchical reinforcement learning approach which learns at two levels, a higher level agent that is learning to perform actions in the form of medium term goals (changes in the state variable) and a low level agent that is aiming to (and rewarded for) achieving these medium term goals by performing atomic level actions. The key contributions identified by the authors are that learning at both lower and higher level are off-policy and take advantage of recent developments in off-policy learning. The authors say that the more challenging aspect of this, is the off policy learning at the higher level, as the actions (sub-goals) chosen during early experience are not effectively met by the low level policy. Their solution is to instead replace (or augment) high level experience with synthetic high level actions (sub-goals) which would be more likely to have happened based on the current instantiation of the low level controller. An additional key feature is that the sub-goals, rather than given in terms of absolute (observed) states, are instead given in terms of relative states (deltas), and there is a mechanism to update this sub-goal appropirately as the low level controller advances. As a side note, I suggest the authors look to the work of George Konidaris, in particular (Konidaris and Barto, 2007), in which agent centric formalisms are championed as most suitable for transfer learning (something the authors say they are interested in investigating in future work). My concerns are relatively minor and a brief overview is as follows: * I think that the cited paper [24] (though unpublished), is slightly closer to this work than the authors allow. * There are some minor technical details, including that relating to prior work (e.g. DDPG method [25], UVFA [34]), which could be made clearer and more explicit (see below). * Some notation could be described more fully, and the reader guided a little to anticipate the meaning and use of certain constructions (e.g. the notation for subsequences of time series and sums over such series should be explicit, the fact that the raw tuples stored for the higher level controller will be later processed into conventional off-policy RL tuples). * Some ideas are reiterated more than once or descriptions could be tightened up a bit, (e.g. Sections 3.2 and 3.3). While I have a few comments on the details of the paper, I think the paper is well written and the work reproducible. The experiments, although in a limited domain, are varied and the results convincing. Moreover, I believe it presents a significant and interesting contribution to the field, and I am recommending it for acceptance. Local references --------------------- George Konidaris, Andrew G Barto (2007). Building Portable Options: Skill Transfer in Reinforcement Learning. IJCAI. Details ====== ### p3, line 104, minor point. > ...for fixed σ... I assume this is standard deviation, but it is fairly common practice to parametrise normal distributions by variance. ### p3, line 111 > ...specify a potentially infinite set of lower-level policies... To be precise, I would suggest *unbounded* rather than *potentially infinite*. ### p3, line 122 > a fixed goal transition function $g_t = h(s_{t−1} , g_{t−1 }, s_t )$ This is quite an unusual feature and is necessary here because goals (being relative) are not static. It would help to tell the reader that here. ### p4, line 130-1 > the higher-level transition (s t:t+c−1 , g t:t+c−1 , a t:t+c−1 , R t:t+c−1 , s t+c ) The reader might be guided here too. ### p4, line 139-40 > To maintain the same absolute position of the goal regardless of state change, the goal transition model h is defined as Being a little picky maybe, but...What are the assumptions on your state space that ensures a relative goal can be (always) defined? And that when updated in the way suggested, will always be meaningful/consistent? ### p5, lines 154-8, clarity > Parameterized rewards are not a new concept, and have been studied previously [34, 19]. They are a natural choice for a generally applicable HRL method and have therefore appeared as components of other HRL methods [43, 22, 30]. A significant distinction between our method and these prior approaches is that we directly use the state observation as the goal, and changes in the state observation as the action space for the higher-level policy... The (unpublished) work [24], cited elsewhere does use states as sub-goals, and also as actions in the higher level policy. But not changes in the state as done here. ### p5, line 159, typo > This allows the lower-level policy to begin receiving reward signals immediately, even before the lower-level policy has figured out how to reach the goal... I think the second *lower-level*, should be *higher-level*. ### p5, line 164, citation needed > While a number of prior works have proposed two-level HRL architectures ### p5, line 169-70, citation needed > off-policy algorithms (often based on some variant of Q-function learning) generally exhibit substantially better sample efficiency than on-policy actor-critic or policy gradient variants. ### p5, line 191 and Equation (5) > log probability...computed as... I think the proportional to sign should be an equals, and this relies on an assumption about the sampling policy that should be made explicity, e.g. as a mean action with isotropic gaussian noise. This part of the method seems the most tied to a specific formulation of the learner, and I wonder whether it could defined in more general terms first, with this as the example (I realise there are other approaches in the supplementary file). Also, authors should be clear at the outset that this is an approximate optimisation. ### p5, line 194 > ...eight candidate goals sampled randomly from a Gaussian... What is the Gaussian's covariance matrix? ### p6, line 223-4, clarification when referring to ref [24] > ...[24]. In this work, the multiple layers of policies are trained jointly, while ignoring the non-stationarity problem. The work cited is unpublished (arXiv), but I happen to have read it. In the cited work, the authors adapt a published method Hindsight Experience Replay (HER), which in their own work is used to replace the goals at lower levels with the actually achieved states. This contrasts with the currently submitted paper, in that the action of the higher level agent is the replaced entity (rather than the lower level goal), but in other respects there is some degree of similarity which could be remarked upon. ### p6, line 231 > In our experiments, we found this technique to under-perform compared to our approach, which uses the state in its raw form. We find that this has a number of benefits. This could be clearer. ### p7, Figure 4 caption > extremely fast... Use "quickly" or "rapidly". Post Rebuttal ============ Reviewer #3's question regarding the significance of one of the results in Table 1 is valid. I still think the paper is interesting and worthy of publication, but I have moved my recommendation down to a 7 to reflect this concern. If accepted for publication, I would encourage the authors to make clear in the camera ready version whether the values are standard deviations or standard errors or something else. The authors should also make clear their justification for the significance claim.

### Reviewer 2

Summary of the paper: This paper presents a hierarchical reinforcement learning method for training two-layer hierarchical policy. The higher-level policy determines the sub-goal and the lower-level policy tries to achieve the given sub-goal. Both policies can be trained in an off-policy manner. Relation to previous work: The policy structure is similar to the one in Feudal network, but the sub-goals generated by the higher-level policy in this work is in the raw form. Most of the prior work on HRL is on-policy, but this paper presents off-policy training of a hierarchical policy. The policy update in this work is based on TD3. Strength: - Since most of prior work on HRL is on-policy, presenting an off-policy method for HRL is one of the contribution of this work. - The experimental results show that the proposed method outperforms existing HRL methods in fairly complex tasks with continuous control. Weakness: - Theoretical contribution is minor. There is no novelty in the policy update and the policy structure. The proposed method looks like a minor modification of FuN to make it work in an off-policy manner. - It seems that authors have employed various tricks to make the method work in re-labeling the goal. - The benefit of the off-policy correction is not sufficiently discussed in the experiment section.

### Reviewer 3

The paper reports an improvement of data-efficiency in hierarchical RL. This improvement is achieved by a novel method, which is a modification of existing methods. The modification consists of two parts: the novel method 1. is capable of off-policy learning, which in turn increases data-efficiency 2. uses the proximity to raw states instead of internal states as low level reward function. The method is tested on four variants of a grid world benchmark and compared to several other methods as well as systematic alternatives. The paper is written very clearly and well structured. It's of very high quality. Still, there are some aspects, where clarification would help. I think it should be made more clearly, which assumptions have to be met in order to apply the method or parts of it: - is the method (or parts of it) limited to deterministic MDPs? I think the reader would benefit, if the authors would make it explicit, that the considered benchmark variants are deterministic. ... that the method makes the assumption, that introducing goal states can help in solving the problem and some kind of motivation, for doing this. On page 2 line 77 there is a very strict statement "these tasks are unsolvable by non-HRL methods". I think this statement is simply not true in this generality. Please consider make this statement more precise, like "none of the testet non-HRL methods was able to solve the task for the investigated amount of interactions". There is an inconsistency between the reported results of table 1 and the text: The text saying "significantly out-performed" while the table states for the "Ant Gather" benchmark "FuN cos similarity" 0.85 +/- 1.17 compared to "Our HRL" 3.02 +/- 1.49. With reported uncertainties of 1.17 and 1.49 these results are "equal, given the uncertainties". It might be a good idea to reduce the uncertainties by repeating the experiments not just 10 times, but 100 times, thus being able to make the results significant. As the text is really well written I assume that the missing capital letters (maxq, Vime, laplacian, ...) in the literature will be fixed for the final version. Choosing the title is of course up to the authors, but I think that the reader would benefit from a title that is less general and more specific to the presented method, e.g. "Data-Efficient Hierarchical Reinforcement Learning in Grid World Environments", or "Data-Efficient Feudal Reinforcement Learning in Deterministic Environments" Post Rebuttal =========== Unfortunately the authors did not respond to the concerns, which I expressed in the review, starting with "There is an inconsistency". This reduces the trust in the correctness of the experiments, the presentation of results, and the conclusions drawn from those. I have therefore lowered the overall score to 6, and expect that the authors either - increase the number of trials to reduce the uncertainties, thus increasing significance, or - come to the conclusion, that the given values after the +/- are in fact the standard deviations of the ten trials, and correct this to the standard error, or - explain, how the values after the +/- sign have been derived, and on what grounds the claims for significance have been derived.