NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 4117 Constrained Reinforcement Learning Has Zero Duality Gap

### Reviewer 1

The paper studies a form of constrained reinforcement learning in which the constraints are bounds on the value functions for auxiliary rewards. This allows a more expressive formulation than the common approach of defining the reward as a linear combination of multiple objectives. The authors show that under certain conditions, the constraint optimization problem has zero duality gap, implying that a solution can be found by solving the dual optimization problem, which is convex. The authors also extend this analysis to the case for which the policy is parameterized. Theorem 1 assumes that Slater's condition holds, which is problematic for two reasons. Slater's condition is usually defined for convex constraints, but the authors specifically state that the constraints in PI are non-convex. Moreover, according to the common definition, the duality gap is zero when Slater's condition holds, which is precisely what the theorem claims to prove. I think there are important details missing from the assumption that the authors need to spell out explicitly. In fact, I think a more interesting contribution would be to show precisely for which reward functions that Slater's condition holds. As it stands I have poor intuition regarding the types of reward functions that will satisfy Slater's condition. The decision problem PI considers a special case in which constraints are all formulated as a bound on a given value function. What would happen if you allow other types of constraints as well? To what degree does the zero duality gap depend on this specific form of constraint? Top of page 4: "Let [...] P(xi) be the optimal value of ~PI for the perturbation xi" -> should this not be PI rather than ~PI? Same question at the top of page 5: "linear program equivalent to ~PI". I cannot find any mistakes in the derivations of Sections 4 and 5. However, I think it is an exaggeration to say that the parameterization comes at almost no price. First of all, the discount factor is usually close to 1, so the quotient epsilon / ( 1 - gamma ) could grow quite large. Second, although the result holds in theory, it is well known that the bound from the universal approximation theorem is hard to achieve in practice, even if the representation is expressive enough. How is the projection onto R_+^m done? Just taking the maximum of 0 and x? Assumption 1 does not appear realistic to me, for the same reason as outlined above: it is difficult to achieve hard performance bounds using e.g. deep learning. POST-REBUTTAL: I appreciate the authors' effort to clarify the underlying assumptions. The assumption regarding the existence of a feasible policy is still strong, since presumably one is allowed to freely define the reward specifications s_i. However, I agree that it is non-trivial to determine that a zero duality gap holds when the assumption holds. Moreover, the bound on the total variational norm is indeed milder than that for the uniform norm, but the bound is still worst-case over the states. The title of Section 4 gives an impression that in my view is too good to be true. Even the unconstrained RL problem has few theoretical guarantees when combined with function approximation. As a side note, I think your definition of the value function in the derivation of the proof of Theorem 1 (in terms of occupation measures) is similar to that of Lewis and Puterman in "Bias Optimality". Also I just noticed that you use "s_i" both for states and reward specifications which is confusing.

### Reviewer 2

The paper looks at reinforcement learning with constraints and provides an analysis of the problem. The authors prove that under common assumptions, the problem has zero duality gap and can be solved exactly in the dual domain. This finding doesn’t necessarily hold when the learned policies are function approximators but the authors provide further arguments to show that under the right type of parameterization, convergence to the optimal solution can still be shown to hold. The paper also describes an algorithm for optimizing the constrained RL objective. I haven’t discovered any gaps, mistakes or other problems in the proofs and argumentation. The paper cites relevant prior work but could be improved by discussing some of this work in more detail. The proposed algorithm differs very little from previous approaches based on primal-dual algorithms and it would have been insightful if there was more discussion about how this algorithm differs and why it would be better or at least complementary. Without this discussion, the presentation of the algorithm only serves as a prototype for which the analysis of the required number of updates is done without getting more insight in the efficiency and expected behavior of similar algorithms that have already been shown to be useful empirically. I find the novelty of some of the individual arguments hard to judge but I do think that the main result is important and that the analyses provide useful insights. I couldn’t find any other work on constrained RL that even mentions the duality gap, so even without the result that the gap doesn’t exist for general policies under certain conditions, the paper points out an important issue that seems to have been glossed over so far. Since the paper mostly shows that the issue is actually not expected to be a problem in practice, the impact is mostly a firmer understanding of why previous approaches work but it also gives some new intuitions about why they might fail if function approximators are not flexible enough. The presented algorithm doesn’t seem to differ much from prior work and without empirical results, it is not possible to judge its value as a contribution on top of its usefulness as an illustrative example. The paper is mostly easy to follow but could be more clear about the main contributions of the paper and how they relate to previous work.

### Reviewer 3

This paper proposes a dual approach for the constrained reinforcement learning (CRL) problem. The authors show that the CRL problem has zero duality gap and thus can be solved by solving its dual problem which is convex. The optimality results of parametrized policies and primal-dual algorithms are also given in the paper. The constrained reinforcement learning problem is interesting and important in practice. This paper gives a good theoretical result on this problem. It would be better if the authors could add an experimental section in the paper. UPDATE：After reading the author’s rebuttal, I have chosen to increase my score for the following reason: The authors provide a numerical section in the rebuttal.