NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 691 Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs

### Reviewer 1

This work gives a gap-dependent regret analysis for tabular MDPs. Since most recent works concentrate on the minimax regret bounds of MDPs, it is important but hard to analysis the problem-dependent performance of algorithms. I think this work gives one step for this analysis. The final result involving the sub-optimal gaps is also natural. The paper is clearly written and easy to read. Further, the definition of $gap(x, a)$ seems to be strange for me. For finite-horizon MDPs, it is possible that $x$ has different optimal action at different $h$. Is it possible that there exists a worst case where the $gap(x,a)=0$ for all $x,a$? If so, what will the regret like? The work is organized in a clear way, although there are many notations. ----- The feedback responds to my questions.

### Reviewer 2

This paper provides with a theoretical analysis of optimistic algorithms in the setting of episodic MDPs. The authors provide with a problem-dependent and gap-dependent log(T) regret bound which is a finite-time bound, that unlike previous bounds does not take into account other problem properties such as diameter or mixing times. The paper includes several theoretical results, and introduces a novel 'clipping' trick used to prove the bounds. I found the paper interesting, however, it is constructed in a way which is hard to follow and not very readable. First, many concepts are referred to before they are defined (for instance, the notion of optimistic algorithms). Furthermore, the paper is packed with notations and it is very difficult to keep track of all of them, it might be useful to remind the notations when they are used in some critical places in the paper. Lastly, I found that intuition is very lacking in the paper. Many technical details are provided but not much intuition regarding the analysis, why is it possible to use the clipping trick and why with the specific clipping threshold that is given, etc. I believe the contributions of this work are important and relevant for the community, and the novelty in analysis can be used in future research in the field, however, I feel that an improvement of the presentation, by adding some intuition, defining the concepts and notations before they are referred to, and clear explanations of the steps to get the bound, will significantly improve the readability of this paper. General comments: line 97 - what does 'average reward of \pi_*' mean? Average over what? Is it an average over states? Do you mean the average return or average over immediate reward? line 168 - You are noting that the regret is bounded by KH = T. It is not clear why is this true without assuming any assumptions on the reward distributions.. if rewards are not bounded, this is not necessarily true. line 200 - you define gap_h - for what h is this defined? all h? Technical comments: line 39 - 'worse-cast' should be 'worst-case' line 100 = 'stage' should be 'state' line 110 - 'upper bounds' should be 'lower bounds' line 228 - 'futher' should be 'further' line 311 - 'that that' - remove one 'that' --Added after author feedback-- I have read the authors' feedback. Given that they will reorganize the paper to improve readability and clarity, and lighten the notation overhead, I vote for an accept.