Reviews: Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs

This paper provides with a theoretical analysis of optimistic algorithms in the setting of episodic MDPs. The authors provide with a problem-dependent and gap-dependent log(T) regret bound which is a finite-time bound, that unlike previous bounds does not take into account other problem properties such as diameter or mixing times. The paper includes several theoretical results, and introduces a novel 'clipping' trick used to prove the bounds. I found the paper interesting, however, it is constructed in a way which is hard to follow and not very readable. First, many concepts are referred to before they are defined (for instance, the notion of optimistic algorithms). Furthermore, the paper is packed with notations and it is very difficult to keep track of all of them, it might be useful to remind the notations when they are used in some critical places in the paper. Lastly, I found that intuition is very lacking in the paper. Many technical details are provided but not much intuition regarding the analysis, why is it possible to use the clipping trick and why with the specific clipping threshold that is given, etc. I believe the contributions of this work are important and relevant for the community, and the novelty in analysis can be used in future research in the field, however, I feel that an improvement of the presentation, by adding some intuition, defining the concepts and notations before they are referred to, and clear explanations of the steps to get the bound, will significantly improve the readability of this paper. General comments: line 97 - what does 'average reward of \pi_*' mean? Average over what? Is it an average over states? Do you mean the average return or average over immediate reward? line 168 - You are noting that the regret is bounded by KH = T. It is not clear why is this true without assuming any assumptions on the reward distributions.. if rewards are not bounded, this is not necessarily true. line 200 - you define gap_h - for what h is this defined? all h? Technical comments: line 39 - 'worse-cast' should be 'worst-case' line 100 = 'stage' should be 'state' line 110 - 'upper bounds' should be 'lower bounds' line 228 - 'futher' should be 'further' line 311 - 'that that' - remove one 'that' --Added after author feedback-- I have read the authors' feedback. Given that they will reorganize the paper to improve readability and clarity, and lighten the notation overhead, I vote for an accept.

The paper investigates an interesting problem in RL in the episodic setup, which is of significance in practical domains. The proposed algorithm, StrongEuler, simultaneously achieves the worst-case optimal regret bound of \sqrt{HSAT} and the problem-dependent regret bound established in the paper, which depends on the gap quantities gap(x,a) for various (x,a) pairs. This is a nice contribution to the filed, and is of practical significance for RL applications. From the technical perspective, this paper provides a solid contribution: In order to derive the logarithmic regret bound with the right dependence on the gap parameters, the authors leveraged novel proof techniques, which could be independent interest for the analysis of other algorithms as well. I have read most of the proofs, and think they are correct (though due to time constraints, I couldn’t check all the details). About weaknesses. Despite the great technical material covered in the paper, its choice of organization makes the paper hard to follow; see detailed comments below. In addition, some related and recent literature on regret minimization in RL seem to be missing. Besides, I have some technical comments, which I detail below. 1. Organization: As said earlier, despite solid theoretical content, one may find the paper hard to follow due to the choice of organization. In particular: (i) one may expect that the algorithm StrongEuler to be fully specified in the main text rather than the supplementary; (ii) The proofs in the supplementary are not well-organized, and are hard-to-follow for most readers; and (ii) there is no conclusion section. 2. About Definition 3.1. The optimism should hold “with high probability”, right? Without this, I am not sure if one can guarantee that \bar Q_{k,h} \ge Q*_{k,h} for all x, a, k, and h. Could you explain? 3. Line 218, the statement “all existing optimistic algorithm”: the aforementioned collection of algorithms is not defined precisely. Since you are making a claim about the regret of such algorithm (which seems to be a strong one), you must specify this set clearly. Otherwise, the statement *should* be relaxed. Another vague statement of this type appears in line 212: “… is unavoidable for the sorts of the optimistic algorithms that we typically see in the literature”: again, I may ask to make the statement more precise and specific; otherwise, the claim needs to be accordingly relaxed. Minor comments: a. Theorem 2.4: \epsilon is not introduce. Does it have the same range as in Theorem 2.3? b. Line 31: The logarithmic regret bound in Jaksch et al. is an exception, as it is non-asymptotic (though it depends on the “wrong gap” gap_*). c. Line 48: the term “almost-gap-dependent” does not appear to be precise enough. d. Line 96: what is \tilde S? Shouldn’t it be S? e. Line 107: P(s’|s,a) is not introduced here yet, and later it seems you use p(s’|s,a). f. Line 107: that can be made optimal --> … **uniquely** optimal (Note the uniqueness here is crucial for the proof in [15] based on change-of-measure argument to work). e. Line 897: I couldn’t verify the second inequality \sqrt{a} + \sqrt{b} \precsim \sqrt{a+b}. Could you explain? Typos: Line 20: number states --> number of states Line 39: worse-case --> worst-case Line 93: discounted regret --> undiscounted regret (Isn’t it correct?) Line 95: [8] give … --> [8] gives Line 120: et seq. --> etc. (??) Line 129: a_{a’} --> a_{h’} Line 166-167: sharped --> sharpened Line 169: in right hand side --> in the right-hand side Line 196: we say than … --> we say that … Line 230: In the regret bound, remove extra “}”. Line 269: \sum_{t=1}^K … \omega_{k,h} --> …\omega_{t,h} Line 268: at the end of the sentence, “)” is missing. Line 275 (and elsewhere): Cauchy-Schwartz --> Cauchy-Schwarz Line 438: the second term does not dependent --> the rest of the sentence is missing. Line 502, 590, and elsewhere: author? --> use correct referencing! Line 525: similar to … --> similarly to … Line 634: Then suppose that … --> is it correct to begin the lemma with “then”? Line 656: proven Section … --> proven in Section … Line 668: We can with a crude comparison … --> The sentence does not make sense. Line 705 – Eq. 19: gap_h --> gap_h(x,a) AND “)” is missing Line 703 – iii : “)” is missing. Line 814: similar to --> similarly to Algorithm 1: Input is not provided. Algorithm 2, Line 4: rsum --> rsum_k AND the second rsum should perhaps be rsumsq Line 1085: Lemma 6 in [6] in shows --> remove the second “in” -- Updates (after author feedback) -- I have read the other reviews and have gone through authors' response. The response satisfactorily clarified my raised comments. In particular, the authors provided a precise plan to revise the organization of the paper. I therefore increase my score to 8.

Paper ID:	691
Title:	Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs

Reviewer 1

Reviewer 2

Reviewer 3