NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 4388 Provably Efficient Q-Learning with Low Switching Cost

### Reviewer 1

Originality: The notion of the "local switching cost" is novel. The algorithm for obtaining low-switching cost and low regret is rather straightforward -- a delayed update to the policy will do. The proof is also similar to that of [15]. The lower bound is simple but the message is clear. Quality & Clarity: The paper is well-written and clearly delivers the message. Significance: The notion of "local switching cost" makes a lot of sense in many real applications. Although the algorithm is rather straightforward, it also shows that a low switching cost algorithm can be achieved by a simple modification to some other low-regret algorithms. (In fact, I conjecture that every low-regret algorithm can be converted to a lower switching cost algorithm.) Line 36: "closed related" ==> "closely related"

### Reviewer 2

i)Originality : Introduction of the notion of “local switching cost” which extends the notion switching cost in online learning to RL. They also present (two flavours of) a Q-learning algorithm that achieve the regret matching the previous work however with the added benefit of having lower local switching cost. ii)Quality : The motivation behind considering the problem of RL with low switching cost is clearly explained. They provide a lower bound on the switching cost of any algorithm with sublienar regret. The proposed algorithm follows a conservative approach to change policies, as is the requirement for low switching cost, however their analysis manages to recover the same regret bound as the previous work. They also show applications of their algorithms in concurrent RL. iii)Clarity: The paper is well written. Both the text and math are clear except few minor issues: a. Line 128: “...,so that the agent needs only play” b. Line 149: “Our algorithm maintains wo sets…” b. Line 254: “…,and it is a priori possible whether such a mismatch will blow up the propagation of error.” b. line 274: “.., at most off by a factor of O(H^2 log K)” iv)Significance : This paper addresses an important problem in RL which has strong practical motivations. They provide algorithms which are provably better at the considered problem than the previous work.

### Reviewer 3

1. Technically, this paper is interesting but I just fail to understand the practical relevance of the problem. And since it puts together a bunch of well-understood ideas, it seems rather incremental. 2. Switching cost in online learning was introduced in the late 1980s but in a general way as opposed to a very specific cost function here. Furthermore, finite horizon MDPs are considered (as opposed to infinite horizon) which kind of negates the argument that it can be costly to switch policies. This is because the optimal policies in the finite horizon are non-stationary anyway. So, it may hardly matter that these change due to exploration. In my opinion, that problem will be much more interesting in a infinite horizon setting and with the both the local and the global switching cost functions. 3. In Theorem 2, it would be better to present exact bound on regret (if it is known) instead of just in O(.). I presume that from their analysis, the authors are able to achieve it. 4. I am not able to appreciate the application to Concurrent setting. Perhaps the authors can elaborate. We already know asynchronous QL converges under suitable recurrence conditions. So what is new here. 5. Why is LB in Section 7 interesting since it holds only when the switching cost is constrained.