Paper ID: 425 Title: Efficient learning by implicit exploration in bandit problems with side observations
Current Reviews

Submitted by Assigned_Reviewer_23

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary:
This paper considers the multi-armed bandit problem with side information, previously studied by Alon et al. [1] and Mannor and Shamir [17], and a generalization of this problem called the combinatorial semi-bandit problem with side observations. In the first problem, we have a standard multi-armed bandit, and we observe the reward generated by the arm we pull, as well as rewards generated by some other arms, where the observations are determined by an observation graph chosen by an adversary. In the second problem, we may pull many arms at a time, observe the rewards for all pulled arms, as well as some additional arms, again determined by the adversarially chosen observation graph. The paper develops an algorithm Exp3-IX for the first problem that improves over previous algorithms (it is easier to compute, is anytime, and does not need to know the observation graph before each decision). It then develops another algorithm FPL-IX for the second problem with similar properties. Both algorithms are provided with regret bounds.

Quality:

Overall, I think this is a good paper. The problems studied are interesting and broad, the algorithms are reasonable, and the regret bounds seem well-constructed. I do have some suggestions to improve the clarity of the paper below, but in terms of quality, I just have two points on which the paper could be improved:

1. Even though the algorithms don't need to know G_t in order to make the decision I_t or V_t, they do need to know them in order to update the weights. This felt hidden to me in the discussion in the introduction, and is an important limitation of the algorithm.

2. For the m=1 case, one could also apply FPL-IX. It would be helpful to provide a discussion of what, if anything, is lost when applying FPL-IX instead of EXP3-IX.

Clarity:

The paper overall is reasonably clear, but the paper could be made substantially more clear through some additional work to correct two issues:

First, there are a number of small errors and omissions listed below. Second, the separate sections of the paper could flow more smoothly together.

In reading section 3, I was left wondering why we went to the trouble of developing Exp3-IX in section 2. While reading section 2, I had the impression that we were developing Exp3-IX in the simple case where we pull one arm per period, and then would generalize in section 3 to the combinatorial case, but then I was surprised when I got to section 3 and found that we were abandoning Exp3-IX in favor of FPL-IX. I believe that I was given this impression by the first paragraph in section 2, which should more clearly state that Exp3-IX will not be developed further for the combinatorial case, and explain briefly why not.

Also, in reading footnote 1 on page 3, and in the discussion of Exp3-IX in previous literature on lines 310-312, it was unclear whether Exp3-IX is being newly proposed in this manuscript, or whether this method already exists in the literature. Based on examining these papers, it appears that Exp3-IX is new to this manuscript, but still the reader is left to wonder.

Similarly, there was some notation defined in section 2 that was redefined in section 3 (\mathcal{F}_t on line 325), and some other notation that was changed or added which could have also been used in section 2 (O_{t,i} on line 326, the vector notation for L and \ell on line 317).

Small errors and omissions:
1. Reading the proof of Theorem 1, it seems that \ell_t is assumed bounded above by 1. I assume it is also bounded below by 0. I did not see this explicitly stated. While of course this is a standard assumption, it should still be stated.

2. The language defining \mathcal{F}_{t-1} on line 149 and (to a lesser extent) line 325 is vague. I assume that what is meant is that \mathcal{F}_{t-1} is the sigma algebra generated by

( \ell_{s,j}, G_s : 1\le s < t, (I_s -> i) \in G_s ),

i.e., all observable losses and observation structures from interactions up to and including time t-1, but then in a number of conditional expectations in the paper, when conditioning on \mathcal{F}_{t-1} it seems that the paper is additionally conditioning on G_t and \ell_{t,i} for all i, so perhaps \mathcal{F}_{t-1} is meant to include these as well.

This specifically seems to be true for equation (2), the unnumbered equations on lines 249, 253, 342, 329, and the statement on line 232 about the dependence of Q_t (Q_t also depends on G_t through o_{t,i}.)

Similarly, the conditional expectation in the unnumbered equation on line 165 seems to additionally require conditioning on G_t (and could condition on \ell_{t,i} without affecting the result).

If \mathcal{F}_{t-1} is meant to include G_t and \ett_{t,i}, then this should be explained more clearly. If not, then the conditional expectations should be corrected.

Also, the expectation on line 191 should be a conditional expectation.

3. \gamma on line 203 should be \gamma_t.

4. In Algorithm 1, steps 11 and 23 use the notation E(G_t) while the rest of the paper just uses G_t. In step 5, why is the 1/d term included, while it isn't on line 214?

5. On line 285, this assumption is made for all v \in S.

6. On line 334, I believe that the expectation of U_{t,i} should be 1/\gamma_t, not \gamma_t.

7. In step 9 of Algorithm 2, the algorithm also requires observing G_t itself, in addition to the observed losses.

8. In step 3 of Algorithm 2, I think this should set \hat{L}_0 to 0.

9. On line 097, "others" should have an apostrophe. Also, on line 102, there are parens missing after sqrt{T} and sqrt{dT}.

10. line 084, "the losses" is repeated. Line 085, "thus" should be "this".

11. The motivating application could be described more clearly.

Significance: This paper represents an improvement over the previously developed EXP3-SET algorithm for the bandit with side-observations problem, both in terms of the convenience of the algorithm, and the quality of the regret bounds obtained. Also, the second problem studied in section 3 is a significant generalization over previously considered models.

Originality: This paper builds on several previous papers, both in terms of the problem studied and the algorithms and analysis developed, but the results presented are novel, and the model is a significant generalization of previously studied models.
This paper represents an improvement over the previously developed EXP3-SET algorithm for the bandit with side-observations problem, both in terms of the convenience of the algorithm, and the quality of the regret bounds obtained. Also, the combinatorial problem studied in section 3 is a significant generalization over previously considered models.

Submitted by Assigned_Reviewer_30

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper analyzes a bandit model that is in between stochastic optimization with expert feedback and standard multi-armed bandits. The model is a natural extension of previous work by Mannor and Shamir (2011) and Alon et al. (2013). The authors propose and analyze novel algorithms inspired by EXP3 and FPL but introduce a new, implicit exploration strategy which helps in addressing significant limitations in the previous work on bandits with side information.

I really enjoyed reading the paper as it is well written and organized. The related work is properly cited and the authors make clear the relation of the paper with previously published results. The key algorithmic idea introduced in section 2.1 is novel to the best of my knowledge, the analysis is non-trivial and the idea could be of independent interest for other bandit settings. It is very interesting that the authors successfully circumvent the need to know the observation graph.

I have a question for the authors: in the definition of eta_t and gamma_t in the statement of Theorem 1, there are no initial conditions set for eta_0 and gamma_0. What are those values? Also, in the description of Algorithm 1, the series of gamma_t and eta_t are given as inputs, whereas they should implicitly change if they are to be set according to Theorem 1 as a function of Q_t. Could you please clarify?

While this is relatively minor issue, I found the presentation of the motivation for the model relatively weak and I had to go back to the paper by Mannor and Shamir to be convinced that the model is not artificial. The paper should stand on its own feet so I do think that the presentation should be improved from this perspective.

-> The example in the Introduction is a bit confusing in my opinion perhaps due to the overlap between the notions of “news” and “news feed”. Also, the description in line 48 doesn’t actually match the example in Figure 1a.
-> There are several notation errors: line 101-102 (no closed paranthesis), line 212-213 \hat{l}_{s,i} instead of \hat{l}_{t,i}
-> The statement in the proof of Theorem 1 at line 250-251 is confusing (“By lemma … and \emph{the parameters}”) as "the parameters" themselves don’t imply anything.
The paper introduces an interesting algorithmic idea to the bandit literature which addresses several limitations of previously published results.

Submitted by Assigned_Reviewer_41

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper considers the problem of bandits with side exploration, and targets a major issue in the existing solutions -- most algorithms need to know and use some parameters like dominating set of the observability graph. They are able to eliminate this requirement by instead using some estimates from the past observations, and get near-optimal bounds. In fact their bound depends only on the average of independence numbers of the observability graph, as opposed to the previous guarantees which depended on the worst independence.

I think the idea in the paper is interesting and useful. This also seems to remind of the idea of using empirical variance in UCB like algorithms to make the algorithms more robust, I wonder if the authors have thought about that.

The paper is well written, the proofs are simple, and look correct.

Some typos
* Second last para of page 4, "Note that slight modification ... p_{t,i} + \gamma" should be \gamma_t
* First para of page 7, what is "F_{t-1,i}" should it be "F_{t-1}"?
I think the paper is interesting, well written and the results are useful.
Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their positive feedback and very useful comments. We will incorporate all suggestions made by the reviewers in the final version of our paper. Below, we address some specific remarks made by each reviewer.

Reviewer 1
----------
1. On the knowledge of G_t
This point is well taken and we will stress that knowledge of G_t is indeed required for the update step of our algorithms. It is important to note however, that full knowledge of G_t (or at least the second neighborhoods of the chosen actions) is absolutely necessary for *any* algorithm to construct low-bias loss estimates, which are of key importance for learning in adversarial settings. Thus, this assumption cannot be lifted unless we make statistical assumptions about how the observation graphs are generated. We will include this discussion in our final version for improved clarity.

2. On FPL-IX vs EXP3-IX
Indeed, FPL-IX could be applied in the m=1 case, and EXP3-IX could also be applied in the m>1 case. The main advantage of EXP3-IX is that its tuning does not require computing (approximate) independence numbers, and thus its guarantees are somewhat tighter than those of FPL-IX. We also believe that as the analysis of EXP3 is better-known and more transparent than that of FPL, it allows for the purest possible presentation of the implicit exploration trick. However, the disadvantage of EXP3-IX is that it usually does not have efficient implementations for combinatorial decision sets. On the other hand, FPL-IX can always be efficiently implemented whenever the offline combinatorial optimization problem admits an efficient implementation. We will clarify that both algorithms can be used for both settings, and include this discussion to explain the need for both of them.

3. On the history \mathcal{F}_{t-1}
Thanks for pointing this out. As usual in games against adaptive adversaries, \ell_{s,j} and G_s are assumed to be measurable functions of {I_k : k < s} (resp., {V_k : k < s}), thus it is sufficient to define the history \mathcal{F}_{s} as the sigma algebra generated by {I_k : k < s} (resp., {V_k :k < s}). (As our bounds hold for *any* deterministic mapping from histories to losses and observation systems, they also hold for randomized adversaries by a standard argument.)

We are very thankful for your more technical comments and will address them in our final version.

Reviewer 2
----------
1. On parameter-tuning
The initial settings are obtained by defining the empty sum to be 0. We will fix this. Also, eta_t and gamma_t are actual parameters of the algorithm. Besides the choice presented in Theorem 1, one could also get meaningful guarantees when setting eta_t=gamma_t as constants for all t, although this would require prior knowledge of the average of the independence numbers generated by the adversary. (Actually, prior work on the problem followed this path.) We will briefly discuss this to clarify.

2. On the motivating example
We have designed our motivating example to clearly present issues of combinatorial decision sets and the possibility of leveraging side observations. We will provide some more examples in our final version, including combinatorial extensions of the examples of Mannor and Shamir and elaborate more on the routing example we mention in our submission.

Reviewer 3
----------
1. On UCB with empirical variance
There is indeed some similarity as our parameters are also tuned as functions of random observations, even though the underlying analytic ideas are quite different, and we do not know how to generalize either approach across the two problem settings. Nevertheless, this is an interesting connection and we thank the reviewer for pointing it out.