NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 2298 Learner-aware Teaching: Inverse Reinforcement Learning with Preferences and Constraints

### Reviewer 1

This paper formalizes the problem of inverse reinforcement learning in which the learner’s goal is not only to imitate the teacher’s demonstration, but also to satisfy her own preferences and constraints. It analyzes the suboptimality of learner-agnostic teaching, where the teacher gives demonstrations without considering the learner’s preferences. It then proposes a learner-aware teaching algorithm, where the teacher selects demonstrations while accounting for the learner’s preferences. It considers different types of learner models with hard or soft preference constraints. It also develops learner-aware teaching methods for both cases where the teacher has full knowledge of the learner’s constraints or does not know it. The experimental results show that learner-aware teaching can achieve significantly better performance than learner-agnostic teaching when the learner needs to take into account her own preferences in a simple grid-world domain. I think the problem setting considered by the paper is quite interesting and well formalized. I like Figure 1 that clearly shows the problem of learner-agnostic teaching when the learner has preferences and constraints. To the best of my knowledge, the algorithm presented is original. It builds on previous work, but comes up with a new idea for adaptively teaching the learner assuming the teacher does not know the learner model in the RL setting. The paper is generally clear and well-written. It is a kind of notation-heavy. While I think the authors have done a good job at explaining most things, there are still some places that could be improved. Here are my questions and suggestions: - In equation 1, what is exactly $\delta_r^{hard}$ and $\delta_r^{soft}?$ What is $m$? Is it $d_c$? - In equation 2, what is $\delta_r^{soft, low}$ and $\delta_r^{soft, up}? How do you balance the values of$C_{r}$and$C_{c}\$ when using soft preference constraints? It would be interesting to see how different combinations of these two parameters affect the experimental results. - In all the experiments, how are the teacher demonstration data generated? When the human teachers provide demonstrations, they can be suboptimal. Will this affect the result significantly? - Why does AWARE-BIL perform worse as the learner’s constraints increase (shown in Table 1)? Any hypothesis about this? Does this mean the algorithm might not be able to generalize to cases where the learner has a lot of preference constraints? - When evaluating the learner-aware teaching algorithm under unknown constraints, only the simplest setting (two preference features) similar to Figure 2(a) is considered. I would be curious to know how well the proposed method can perform in settings with more preference features such as L3-L5 (or maybe provide some analysis about it in the paper). - In Figure 3(b), what are the standard errors? Are the reward differences between different algorithms significant? - While the paper focuses on enabling the teacher to optimize the demonstrations for the learner when the learner preferences are known or unknown to the teacher, it would be interesting to address the problem from the learner’s perspective too (e.g., how to learn more efficiently from the given set of demonstrations when the learner has preferences?). UPDATE: Thanks for the author's response! It addresses most of my questions well. I have decided to maintain the same score as it is already relatively high and the rebuttal does not address the significance and scalability issues well enough to cause me to increase my score.

### Reviewer 2

Review unchanged after rebuttal. My comments are high level: In the introduction author's may want to reconsider example they use to demonstrate learner's domain conflict. This example of auto-pilot makes one believe that the teacher is the bad one. Teacher suggests that learner should break rules endanger humans etc. to achieve some \pi*. Looks contrived and actually does disservice to the message paper wants to put across. Though answered and addressed later in section 5, learner agnostic teacher should obviously under-perform vs leaner aware due to lack of information [ assuming you can actually use that additional information], this makes results of the paper sound trivial. Again this undersells paper due to the way the authors propose and setup problem in introduction. Only after reading section 5 I realised that the teacher not being aware of exact learner constraints is also handled in the paper. I would suggest this paper being submitted as a journal paper rather than a conference paper, e.g., Theorem 2 and section 5.1 [which are interesting bits of the paper] are entirely in supplementary material. If i decide to read this paper alone without the supplementary material, I would have very little to take away. I am all for writing detailed proofs in Appendix, but when you write algorithms central to your theme in appendix, I start wondering weather this is a journal paper squeezed into 8 page limit due to prestige of NeurIPS.

### Reviewer 3

Summary: The paper considers a problem of learning a teacher agent for IL, where a teacher agent aims to find demonstrations that are most informative to an IL agent. Critically, the IL agent has some preferences on its behavior which makes learning from teacher’s optimal demonstrations not appropriate. The paper formulates such an IL agent as maximum causal entropy IRL with preference constraints. The paper then proposes 4 approaches to learn the teacher agent: 2 approaches for known preferences and 2 approaches for unknown preferences. Evaluation on a grid-world task shows that the proposed approaches learn better policy compared to a naïve teacher that is not aware of the IL agent’s constraints. Comments: Overall, this is an interesting paper; It considers an interesting learning setting that is very different from the usual IL setting, and it presents technically sound approaches that come with performance guarantees. However, I am reluctant to give higher score due to limited practicality since the proposed approaches require reward functions to be linear. Also, the experimental evaluation is done in a small grid-world task which is very different from the three examples given in the introduction. ===== I read the rebuttal and the other reviews. My opinion of the paper remains unchanged, and I vote to weak accept the paper.