NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:8294
Title:Nonparametric Contextual Bandits in Metric Spaces with Unknown Metric

Reviewer 1

The paper is clear, and makes efforts to highlight the behavior of the proposed algorithm (value of the regret bound for some specific settings, experiments measuring the impact of the metric). The comparison to other settings may still be enforced. Typically, I would appreciate knowing the theoretical upper-bound of the regret of Approx-Zooming-With-No-Arm-Similarity. In the experimental part, I would also appreciate the comparison to include some state of the art algorithms. What would be the empirical results of a gaussian process-based bandit? It would also be interesting to have results on datasets used by other contextual/similarity-based bandits (except that these datasets use a context in R^d). Finally, it's surprising to have a context space of dimension 1. Extending the algorithm to R^d setting seems strait-forward. It would explode the time before sub-partitioning, but would it have more disadvantages? Would the R^d setting be more difficult to analyse? __________________________ # POST REBUTTAL I thank the author for their precise and clear rebuttal. It strengthens my positive opinion wrt. the paper. Regarding Gaussian process-based bandit, I was expecting a gaussian process per arm, but the version requiring an oracle access to a metric among the arms would also be informative wrt. to the efficiency of sub-partitioning. Regarding dimension-d context, given its necessity for most applications, the more the paper includes results wrt. that setting, the better. I understand experiments may not be conducted in the limited timeline, and the regret bound in the rebuttal gives a preliminary answer to the impact of dimension. The next step could be to also give the regret bound with dimension-d context for both concrete examples: finite types, and Lipschitz wrt. continuous arm metric space.

Reviewer 2

Originality: This work extends the contextual zooming algorithms to deal with the case of unknown similarity measure over the arm space. This is achieved by clustering the arms according to an an estimate of the integrated squared distance between two arms, based on a k-nearest neighbor estimators of the reward functions. This work is a nice combination of existing ideas. Quality: I think that paper is technically sound, although I did not verify all the proofs in the appendix. Clarity: The paper is well written and the key ideas are nicely presented. Significance: Adapting to unknown regularity parameters and hidden structures in a data driven manner is an important topic in nonparametric statistics, and this paper presents a method of adapting to unknown structure in the arm-space for nonparametric contextual bandits. I think that this is a well executed incremental contribution. Minor Comments: (i) Line 100: 'et al.' instead of 'et al' (ii) Line 104: has d_c been defined? (iii) Line 160: 'Nature could apply a measure preserving transform ...' -- Can you please illustrate this point with an example. ___________________ POST AUTHOR RESPONSE: I have read the authors' response as well as the other reviews. I thank the authors for clarifying about the generalization of their algorithm to higher dimensions. I understand that it may not be possible to modify the algorithm to make it adaptive to the H\"older smoothness by incorporating techniques of Qian&Zhang (2016) within the limited time. I will retain my previous score of 6 for this submission.

Reviewer 3

This paper modifies upon the contextual zooming algorithm (reference [22]) to learn a nonparametric contextual bandits. When the metric among the arms is unknown, a partitioning in the space of estimated reward function is used. Theoretical guarantees for the regret is provided. The result does seem to make sense. Although I must admit that I didn't check all the proofs. One fact that seems curious to me is that the overall regret bound can get better when the estimate for the Lipschitz constant of the reward function is looser. For example, in Eq. (9), the overall regret scales inversely with L^2. I wonder whether this artifact of the subpartitioning step in the contextual zooming algorithm can be easily removed?