Reviews: Doubly-Robust Lasso Bandit

interesting approach to contextual bandits by combining Lasso type methods with doubly-robust technique to handle missing data. The presentation of the paper needs some improvement, e..g, the relation between the sparse linear model used in Section 3 to the orginal bandit model (1) should be sharpened. It would be nice to put the obtained regret bounds into context of fundamental lower bounds, e.g. obtained in the context of adaptive compressed sensing E. Arias-Castro, E. J. Candes and M. A. Davenport, "On the Fundamental Limits of Adaptive Sensing," in IEEE Transactions on Information Theory, vol. 59, no. 1, pp. 472-481, Jan. 2013.

Reviewer 2

The exposition is quite unclear, and the paper seems hastily written. The main contributions of the paper is not outlined clearly, and not compared rigorously with existing results (see below). I believe there are enough theoretical contributions in the paper, but do not recommend publication as is. Ideally, I would recommend a "revise and resubmit". Even when presented in the best light, the main regret bound gives a significant deterioration in terms of its dependence on T, which is concerning. A thorough empirical evaluation is necessary to see if this deterioration is real, and whether it indeed scales better with dimension. 0) The role of double-robustness is quite unclear. It can act as a control variate and reduces variance, but a simple truncation would do the same thing. Is double-robustness fundamental to the regret bounds given by the authors? This point needs to be clarified, as the current manuscript seems to emphasize the doubly robust form very much. 0') Why is regressing on the doubly robust estimator \hat{r} better? Furthermore, the actual "double robustness" property never seems to appear in the paper, which is a bit concerning, and alarming. 1) The setting of paper (stochastic linear contextual bandits), and its main contributions should be clearly outlined much earlier. 2) The compatibility assumption required for Lasso seems to be used in the authors' analysis. The wording in S1 seems to imply that authors do not require this, which is misleading. 3) Bastani & Bayati studied a different setting where the contexts does not depend on arms. Therefore, a naive regret bound comparisons seems to be off. If we use a naive adaptation of authors' setting to theirs, then we get a O(s_0 log (dNT) \sqrt{T}) regret bound. This needs to be clarified in Section 1. 4) The substantially worse dependence on T needs to be clearly stated, and justified. Is it real? Is it from the analysis? Is the resulting improvement over N and s_0 worth the deterioration in practice? 5) Numerical experiments should prove/disprove the claims made in the above point (4). Authors also claim the fact their algorithm has one fewer hyperparameter than that of Bastani and Bayati as one of their main contributions. Does this actually lead to better practical algorithms? Minor comments: - S1, "competitive performance": What does this mean? Does it mean in terms of regret bounds? Or practical performance? - S1, Li et al (2010): what was the dimension here? - S1, "reinforcement learning": replacing this with bandit settings would strengthen the statement, although this whole paragraph seems misleading at best. - S1, "not utilized by previous methods": does this refer to previous contextual bandit algorithms? The wording is quite confusing, and depending on what the authors meant, I would disagree.

Reviewer 3

The authors propose a double robust bandit algorithm for solving the contextual linear bandit problem. In particular, their algorithm is independent of N, the number of arms and instead depends on the sparsity of the context. This is achieved by dipping into the missing data literature and adapting the doubly robust technique (Bang and Robbins 2005). Experimental results are shown on synthetic datasets with varying correlation patterns between the context vectors. Comments: Overall, the paper is well-motivated and written in a clear manner. Contextual settings are of vital interest in many real-world problems such as advertising and recommender systems. Reducing the dependence of regret to the sparsity of the context vector is of immense value since side information is usually very high-dimensional and sparse.

Paper ID:	3162
Title:	Doubly-Robust Lasso Bandit

Reviewer 1

Reviewer 2

Reviewer 3