NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID:1735
Title:Strategic Attentive Writer for Learning Macro-Actions

Reviewer 1

Summary

This paper presents a way to learn options without having to specify when they should start and for how long they should extend. The model has a ConvNet that works on the input state and that produces an intermediate (hidden) representation z. From z, the model produces 1) an action-plan (A), which is a set of sequences of actions; 2) a commitment-plan (c), which says if/when we must replan the actions or if we can keep executing the current one without looking at the input; 3) an attention vector (\psi) that will select some parts of the action-plan. The authors propose to do structured exploration with macro-actions by adding some gaussian noise in z, before computing A, c, \psi. The model is trained either through negative log likelihood (in supervised learning) or asynchronous advantage actor-critic (A3C, in reinforcement learning). The authors show experiments on character-level language modelling, 2D maze navigation, and 8 ATARI games.

Qualitative Assessment

This is good paper that blends ideas from the recent literature on attention models for vision (e.g. DRAW) into reinforcement learning, with the aim at learning options. It is quite verbose at times but easy to follow. It could have impact by removing the need to explicitly subsample (in timesteps) actions in "high" frequency games, and pave the way for options learning. The main limitations of the paper are in the experiments: - the text task (character-level LM) has no quantitative results, I get that this is a toy experiment, but situating STRAW w.r.t n-grams and LSTM would be better. - On the 2D Maze, it seems to me the only conclusion one can draw is that learning on bigger mazes can be mitigated by STRAW by learning to repeat several times the same action. A baseline that should be tested would be LSTM with (precomputed) n-grams of actions, or an LSTM with some caching of frequent sequences of actions. That would help check if STRAW is only doing this or more. - The ATARI experiments are the most complete and interesting. Table 1 should have a line with the current (previous) state of the art for each of the games (I had to check in other papers to confirm that STRAW/STRAWe has very good performance). - The ablative analysis (5.5) is good to see, but it could be more complete (and the experiments are limited in epochs). A proper ablative analysis would test which of A, c, and \psi really need to be learned or could be "random". Form: the figures are way to small to be read on paper, I had to load the PDF and zoom quite a lot. Please at least increase the font size in figures. Overall I would like to see this accepted at NIPS, in particular if the authors

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 2

Summary

1735: Strategic Attentive Writer for Learning Macro-Actions This paper presents an application of the (previously published) attentive DRAW system to reinforcement learning. The system draws the upcoming N actions on a small action slate, and an extra gating mechanism is added that determines whether a draw is performed at the current time step. This gate allows the system to switch off drawing, and thus commit to a current plan. The main argument of the paper is that this naturally learns to commit to a sequence of actions and thus forms macro-actions, in a fully end-to-end differentiable system.

Qualitative Assessment

Summary of Recommendation: The paper introduces an original idea. Committing to a plan has been introduced before in RL, e.g., in Sutton's options literature (where no learning occurs), and Schmidhuber's hierarchical RL systems of the early 1990s, and Wiering's HQ learning, but the new approach is different. However, the formalisation and experimental section seem to lack clarity and raise several questions. In particular, the experiments don't show very convincingly that the attentional mechanism is needed (although it seems like a very nice idea) and the actual behaviour of the attention is not explored at all. I don't see this as a fatal flaw, but this is definitely problematic since the title and main thrust of the paper rely on it. However, I would be inclined accept the paper under the condition that the issues below are addressed in the revised version, which I'd like to see again. Please fix the following issues: Notation/Presentation: - Inconsistent notation in paper and schematics: E.g., the schematic uses 'C' for action plan, and 'd' for commitment, while the text uses 'A' and 'c' respectively. Please be consistent. - Algorithm 1; the if statement is superfluous since the writing step is already gated by 'g' - Eq2: 't' is used in super and subscript; inconsistent. Also, what is 'e' and why did 'phi' lose its subscript 't'. - The figures in Fig 3 are obviously too small and are generally explained only vaguely (both in text and caption). This should be fixed. - Figure 4 is supposed to show the macro actions, by presenting the game states in the planning states. However, since it's unclear how the game works it is almost impossible to see whether these represent indeed macro actions. I would rather see this result for Pacman or Amidar, where there are clear macro-actions (climbing ladder, walking down corridor). - Figure 6 is again way too small. Terminology: - why use the term 'structured exploration', it's just adding noise to the internal representation, nothing structural about it. I think the term is confusing and should maybe be altered or clearly justified. Related work: The paper says: "Options are typically learned using subgoals and’pseudo-rewards’ that are provided explicitly [7, 8, 28]." But there were papers that learned the entire plan and the decomposition by themselves. In particular: - Closely related seems Hierarchical Q-Learning by Wiering & Schmidhuber (Adaptive Behavior, 1997). Occasionally the higher-level RL controller of HQL transfers control to a plan in form of a lower-level RL agent. This works even in partially observable environments. Please explain the differences to the present system! - Closely related in a different way is Schmidhuber's fully differentiable learner & planner based on two interacting RNNs called the model and the controller, where the plan of the controller is continually improved through gradient descent using the model: "An on-line algorithm for dynamic reinforcement learning and planning in reactive environments. IJCNN 1990." Please explain the differences to the present system! Methodology/Experiments: These are the most important issues. - The maze experiment is supposed to use a CNN, but in the experiments suddenly an LSTM appears. This is confusing. And is it the original LSTM of 1997? Or the one with forget gates by Gers et al (2000), which most people seem to use now? - The results are 'averaged over top 5 agents'. What does this mean? How many runs were made to create these top 5? Without explanation it's hard to estimate the expected variance of the results and thus their significance. E.g. is a score of 6902 for Pacman way better than 6730, or explained by variance? Please elaborate. - 'Replan at random' seems to do as well as regular STRAW. A clearer acknowledgement of this fact is needed. - What I miss most is a discussion of the behaviour of the attention mechanism. Where in time does it write, does it change at all? Grammar: ~ [an] adequate temporal abstrations ~ [resotres]

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 3

Summary

This paper proposes a way for an agent to learn so-called "macro-actions" (where one macro-action is a sequence of low-level actions), to help long-term planning in reinforcement learning tasks. The core idea consists in predicting the sequence of actions to perform in the next N steps. The agent follows this sequence until a new prediction is made, which occurs at "re-planning" steps also decided by the model. More precisely, a re-planning step triggered at time t consists in predicting for each future time step t+1, ..., t+N both a distribution probability over actions and a probability of triggering another re-planning (using an attention mechanism to actually update predictions only for a small number of time steps). The main experiments are on reinforcement learning tasks (2D maze and Atari Learning Environment (ALE)), showing the benefits of learning such macro-actions. Importantly, better results are obtained by adding noise to the state representation, making agent predictions more robust.

Qualitative Assessment

The paper is clearly written (up to a few notation issues detailed below), with proper references to related work. It addresses a very relevant problem in reinforcement learning: automatically learning high-level abstractions of actions to allow better long-term planning. To the best of my knowledge the approach taken here is original, and the ideas presented are sound, making use of several "hot" methods in recent machine learning applications (Asynchronous Advantage Actor-Critic, attention mechanisms, LSTMs, Convolutional Neural Networks...). Experimental results are promising, indeed demonstrating that the proposed algorithm helps the agent perform better in most tasks. Overall I believe this is very interesting research that might eventually lead to significant advances in the field. It does not however, in my opinion, reach that point in the current state of this submission. Although there are many relevant experimental results shown here (and no room left for more due to the 8-page limit), I still feel after reading the paper that it is not entirely clear why and how it works. Here are some examples of questions that are not answered in the current state of the paper: - How useful is the attention mechanism? It looks like the agent is systematically re-planning only the next few steps anyway (see e.g. Fig. 5 and supplementary video) - In the maze experiments, is the improvement over LSTM brought by the long-term planning part, the noisy state representation, or both? - Is the "action repeat of 4" in ALE still useful for such a model, that is supposed to be able to learn by itself if and when it is good to commit to multiple consecutive actions? (I know it is standard practice in ALE but it would be interesting to see if your algorithm can do better) - Extra computational cost aside, would the agent perform better or worse if it re-planned at each time step at evaluation time? (but not during training as in 5.5) - What would Fig. 6 look like over 500 epochs instead of 100? (there is no justification for looking only at the first 100 epochs) What would it look like across more games? - What are typical macro-actions learned in ALE? Other small remarks / questions: - Are actions also sampled during evaluation, or is the highest scoring action used? - Similar question for the noise added in STRAWe (is it also added during evaluation?) - "a matrix D of the same size as At, which contains the patch p scaled and positioned according to ψ": with the rest set to 0? - Please clarify earlier (in paragraph of l. 145) how the gradient flows through g_t (instead of waiting until the last sentence of section 4): it is confusing in Alg. 1 because it looks like a useless multiplication by 1 and the link to c_1^t-1 is unclear at that point. - The formatting of eq. 2 makes it hard to read - Eq. 3 is confusing: should it be "t=1" below the sum? Could you please define 1_gt? Should it be zeta instead of phi? What is c_t[t]? (by the way it should be c^t, not c_t) - "The most natural solution for our architecture would be to create value-plan containing the estimates": the wording "would be" makes it sound like you are doing something different, but as far as I can tell it is what is being done here. Just use "... is to create a..."? - l. 254: "Notice how corners and areas close to junctions are highlighted" => not all corners are, maybe add "some"? - l. 263: "every 2 million training steps" => better say "every two epochs" since Fig. 3b is plotted in epochs - What are the blue / black horizontal lines in Fig. 5? - Why does replanning occur in Fig. 5 even though it looks like the next two re-planning bits (last row in the figure) are off in all frames? Typos and notation issues: - l. 7: "re-planing" - l. 43: "it is committed its action plan" - l. 57: "in genral" - l. 60: "enviroment" - C^t and d^t are used in some places instead of A^t and c^t: in Fig. 1, Fig. 2, l. 143 and Fig. 5 - l. 111: "it used to update" - l. 117: "A grid of K × A of" - l. 118: "co-ordinates" - l. 131: "h is two-layer" - l. 138 the last c_t should be c^t, and the last psi^c should be psi_t^c - l. 219 "STRAW(e)" should be STRAWe - l. 236: "it's" - l. 258: "uses a larger 15 x 15 random mazes" - Fig. 4: "From top to bottom" should be "From left to right" - Fig. 5: "drasticly" and "resotres" - l. 300: "and is beneficial to re-plan often" Update after author feedback: Thanks for the response. I hope some of the questions I asked above can be answered in future work. Regarding the attention mechanism, thanks for pointing out the differences in planning horizon in the video. I still believe though it would be interesting to better understand how it contributes to the overall agent behavior: do you really need 3 attention parameters (position / stride / width) or would a single planning length be enough? (maybe even fixed for a given game?) Also do the read & write attention parameters need to be the same? Thank you also for mentioning the results when re-planning at each step. It would be interesting to add a remark about it in a final version (but please clarify why internal states differ, as I am not sure why this is the case).

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 4

Summary

The authors propose a new neural network architecture that automatically learns macro-actions in reinforcement learning. They present results in various domains, including a number of ATARI games.

Qualitative Assessment

The paper addresses one of the central problems in reinforcement learning, automatic discovery of high-level behaviors (temporal abstraction). The approach proposed by the authors is quite different from those proposed earlier in the literature. Experimental evaluation is in a large set of problems. The results are promising. While I understand the approach at a high level, and found it interesting and novel (within the context of temporal abstraction in RL), I do not fully understand all facets of the algorithm. The authors can do a better job of describing the algorithm and analyzing how it works more clearly and in greater depth. For instance, I do not have a good understanding of what type of macro-actions are produced and why these types of macro actions are particularly useful. In the maze domain, the corner states seem to be the locations with the highest rates of replanning. Why is that? And how is that useful? In the Frostbite domain, I do not follow the authors' description of the macro-actions produced. The frames shown in Figure 4 seem like randomly-selected frames from the game. It would be useful to first describe the game briefly, including the set of primitive actions available. It would also be useful to show the trajectory between the frames where the algorithm replanned. Inconsistencies in notation: the action plan is sometimes denoted by A, other times by C (e.g., in figure 1). Commitment plan is sometimes denoted by c, other times by d. Figure 3a requires a scale showing the replanning rates denoted by different shades of red.

Confidence in this Review

1-Less confident (might not have understood significant parts)


Reviewer 5

Summary

This paper proposed a new deep LSTM network architecture to learn temporally abstracted macro-actions of varying lengths. They apply differentiable attentive reading and writing operations in [10] and define attention over the temporal dimension instead of spatial attention.

Qualitative Assessment

The novelty of this paper is incremental, since it followed [10] applying temporal attention instead of spatial attention. The symbols in Figure 1 & 2 are not consistent with the text, which makes confusion.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 6

Summary

This paper proposes a recurrent model that can build internal plans which are continuously updated and committed at each time-step. The proposed model can learn to segment this internal representation into sub-sequences of actions by learning the duration of how long the plan should be committed to. The whole model is trained end-to-end by using A3C.

Qualitative Assessment

This is a very interesting paper that proposes a novel recurrent model. The uniqueness of the model is due to the fact that it can construct its own plans dynamically update and commit them. The model is generic and could be applicable to several different tasks. However the text could be clearer and that would make the understanding of how the model works easier. Especially terminology and the notation becomes confusing. The figure 1 is a bit confusing. In that figure, I think the action-plan matrix is being represented as C instead of A and the commitment plan as d_t instead of c_t. I would expect that this kind of model could have been useful for games that could be more sensitive to the planning such as Montezuma's revenge. However, authors only provided results on 8 atari games. How did you choose these games? Are they cherry-picked? Do you have any explanations on why STRAW performs worse on Hero game? Minor comments: It is quite interesting that the structured exploration makes a huge amount of improvement. Can it be just a regularization effect or does it also influence the optimization as well? In the paper, the authors are using an attention mechanism inspired from [10], have you tried using different types of attention mechanisms. To summarize, the strengths of this paper are: 1) Proposes, a novel model that can construct the internal plans for the task in order to accomplish a particular task. 2) End-to-end training of the model with reinforcement learning 3) Interesting experimental results. Weaknesses are, 1) The text could be more accessible and clear.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)