NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 1183 Generalized Off-Policy Actor-Critic

### Reviewer 1

This paper is largely well presented and is original work on an important problem. It contributes a new actor-critic approach to dealing with covariate shift in an off-policy objective. Empirically, this seems to result in some gains in performance, although it would need to be evaluated in a wider range of tasks and situations to understand if it more broadly applicable.

### Reviewer 2

Originality: - Paper is the first to propose the counterfactual objective, which allows for algorithm users to smoothly interpolate between the alternative life objective and excursion objective. - The paper takes the gradient of this objective directly and find a set of updates using the same inspiration as previous methods in order to introduce a new update algorithm according to this objective. Quality: - Mathematical arguments are convincing and reasonable. - Paper is clear about not necessarily doing better than OPPG algorithms like DDPG, but instead showing the effectiveness of emphatic algorithms in a difficult domain using nonlinear function approximators. - Potential issues about the instability of estimating F, C, and V are mentioned, which could lead to the algorithm not converging as well as in theory. - Unclear how the research into GOPPG can improve applications of OPPG. Since this paper does not lead to significant results, perhaps a longer analysis of this and suggestions about improving OPPG methods would benefit the work. Significance: - First emphatic algorithm to make significant progress on the Mujoco domain. - Results are not that signifcant over SotA for these environments, and do not compare to more recent OPPG algorithms like SAC and TD3. Clarity: Paper is fairly clear and well organized. Math is understandable, and the more space taking proofs are in the appendix. Paper is clear about its downsides. Nits: L111 w.p. => with probability, w.p. is not a common acronym

### Reviewer 3

Originality: My main concern is about the novelty of proposed method. Actually, the key part in the proposed counterfactual objective and the gradient is the covariate shift term d_gamma. However the way of correcting for the covariate shift are directly from [Gelada and Bellemare, 2019], and the rest is just put it into the policy gradient theorem in the ACE paper. Clarity: The paper is written in a clear way. Easy to follow. Significance: The comparison between DDPG and Geoff-PAC shows good practical performance and potential of the proposed method. I have another minor concern with respect to the implementation of the method. It looks not clear to me how the term g(s) in Thm 2 can be computed efficiently, especially it is later mentioned that the algorithm use a replay buffer. === After Rebuttal === After read authors' response about the novelty, I would like to increase my score accordingly. I hope the original contribution of this paper comparing with Gelada's work will be presented more clearly in the final version. Particularly on: 1) Explain the necessity of using a new counterfactual objective, and the advantage comparing with the direct gradient re-weighting method. 2) Highlight the novel part is how to computing F_t^{(2)}.

### Reviewer 4

Originality: I think the method is new and the counterfactual objective is first studied. The objective is related to [Imani, E., Graves, E., and White, M. (2018)] and use [Gelada, C. and Bellemare, M. G. (2019)] methods to tackle the new objective. Quality: I would leave my main concern for the intuition for your objective function, and the difference between equation (4) and equation (5). Without emphatic term i(s), the main difference between the two objectives is the sample distribution. The original objective for on-policy setting is to let the distribution as $d_0(s)$, the intial distribution for $s_0$. The excursion objective can be explained as we first execute behavior policy $\mu$ until converge and then we execute target policy $\pi$ (since we are now using value function $v$ under policy $\pi$). The excursion objective make sense if $P^{\pi}$ is ergodic where we don't need to care about the initial distribution. In this case, objective (5) is almost the same except that we first run enough step to reach stationary distribution for policy $\pi$ and then we execute $\pi$ again. If (4) and (5) are not different under this reason, the counterfactual objective in this paper, which can be thought as a shrinkage version between (4) and (5), should not be much difference from the excursion objective which is already well-studied in [Imani, E., Graves, E., and White, M. (2018)]. I would need more explanation from authors for this main concern. For experimental part, figures 2,3 don't seem to show convincing result that Geoff-PAC over OPPG and DDPG. And from figure 4,5 it seems all the methods is not good enough compared to other paper's results (e.g. this paper https://arxiv.org/pdf/1812.02900.pdf). Clarity: I think the paper is easy to follow and well-organized. Significance: As I mentioned in quality part, and empirical results seem not as good as other off-policy optimization paper. And I would also like to hear more discussion on the choice of $\hat{\gamma}$, I believe we can choose that in a clever way using shrinkage method and may discuss that on the paper as well. In sum I think this is an interesting paper. But since I have a main concern on the objective the authors propose I tend to reject the paper at this moment.