Export Reviews, Discussions, Author Feedback and Meta-Reviews

Paper ID:	333
Title:	Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits

Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

Review after rebuttal : - I agree that it is not exactly the same, but it seems close enough to me : if the horizon is large enough, the coupling of the context vanishes (and given that, I don't understand why . So I am not sure why each occurrence of a context cannot be treated as a separate instance of a bandit problem). - To me, this paper would be much better with a clean (the supplementary material's bound has to be reworked) upper bound in term of arm gaps for the problem dependent case, and ideally matching lower bounds in problem dependent and independent cases.

----------------------

The authors consider the problem of a budgeted and time constrained contextual bandit. The main finding of their paper is that if the number of context and actions is finite, and if the time constraint T is sufficiently large (with respect to the problem at hand) then they prove that a regret of order \log(T) is achievable, which is a new result for contextual bandits.

The results seem thorough, and the writing is precise. The results are interesting but not very surprising. Indeed, since the number of contexts J and arms K are finite, this bandit problem can be seen as a a time constrained and budgeted bandit problem with JK arms, which is also action constrained since at each time, not all JK arms are accessible (only the K arms corresponding to the context). Since the number of arms is finite, for T large enough, the regret is \log(T) as in action constrained bandit problems. So I am not sure how innovative this paper is - maybe the authors can clarify this in their rebuttal.

I have a few questions regarding this paper : - In Theorem 2, the regret inn case 2) is O(\sqrt(T) + ...). But is it really O(\sqrt(T) + ...)? Shouldn't it be O(\sqrt(KT) + ...) at least? - Still in Theorem 2, the regret is either O(KJ \log(T)) or O(\sqrt(T) + KJ\log(T)) depending on the configuration of the arms. In general, problem dependent bounds are expressed in a more refined way with the sum of inverse of arm gaps. Isn't it possible to do something similar here? I understand that you have KJ instead of a certain sum of gaps because your proof is based on events characterizing the correct orders of the UCB, and not concentration bounds on their gaps. But is it an optimal idea? - It would be interesting to have lower bounds for this problem, ideally problem dependent and problem independent. Trivial lower bounds are the ones of classical bandit which correspond to cost equals 0 and context which is always the same. Can it be refined to take into account the complexity of the context?