Paper ID: 880
Title: Combinatorial Cascading Bandits
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors investigate a class of combinatorial cascading bandit problems. The archetypal problem in this class seems to be that of routing packets in a network where edges are subject to random independent failures. I will restrict to this problem to simplify the description (though the authors also discuss the disjunctive version of this reward function). At each time step the agent chooses a path and receives unit reward only if none of the edges along the path failed. If some edge failed the agent gets a reward of zero and is also given the feedback of which was the first edge that failed. The algorithm they design for this problem called CombCascade, essentially builds UCB estimates for edge failure probability and picks the path with highest expected non-failure probability. They prove upper bounds for the regret of their algorithm one that is gap-dependent and one that is not. They argue that the upper bound matching the previously known tight upper bound for the stochastic semi-bandit problem (which is a more informative feedback setting) perhaps indicates that their algorithm is tight in settings that are "close" to the semi-bandit setting. The paper concludes with some experiments on synthetic and real datasets that show sub-linear growth of regret for their algorithm and they show that it performs better than CombUCB1 which is a stochastic combinatorial semi-bandit algorithm.

There are a couple of limitations of the study. One, as the authors also point out, is the rather strong independent failure assumption of the graph edges. [Aside: I appreciate the authors' honest comments pointing out the strongest and weakest parts of their paper, I wish more papers did that, not so much to simplify the referee's task, but to stimulate any reader of the paper.] Secondly, their work points to tantalizing connections to the semi-bandit setting, but this is not clarified well. This is perhaps food for future work. For instance, the weight estimation stage of their algorithm hints at a possible reduction between the two settings. On the other hand, the presence of interesting practical problems in their setting argues for acceptance of the paper despite these issues.
Q2: Please summarize your review in 1-2 sentences
The authors design bandit algorithms in a new combinatorial cascading bandit setting. They prove regret bounds for the algorithm they design. They evaluate their algorithm on synthetic and real examples.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary: This paper formulates the combinatorial cascading bandit problem where the decision space consists of sets of items and reward is defined in a conjunctive or disjunctive manner. This problem is different from the well-known combinatorial semi-bandit setting whose reward is defined as a linear function of the decision. The authors propose an algorithm for the bandit problem and prove gap its dependent/independent O(log n) regret bounds (n: # of trials). Finally, the authors show experimental results for three real data sets

Comments: Quality: The technical contribution of the paper is presented very well in terms of both theoretical and empirical aspects. The regret bounds are reasonable, but there is no lower bound yet.

Time complexity of the proposed algorithm depends on the maximization problem of the objective f over the decision set A. It would be nice if the authors could discuss more details about how efficiently the optimization problem can be solved for particular combinatorial decision sets

Clarity: The paper is concisely organized and well-written.

Originality: An interesting contribution of the paper is to propose a bandit problem with a nonlinear objective (disjunctive or conjunctive) and its solution. This contribution seems new, whereas previous approaches consider linear objectives. However, curiously, the analysis of the paper heavily depends on the previous work of Kveton et al.

Significance:

The significance of the paper lies in the bandit algorithm for optimizing conjunctive/disjunctive and the analysis

Q2: Please summarize your review in 1-2 sentences
This paper investigates a new combinatorial stochastic bandit problem with a nonlinear objective.The solution to optimize the nonlinear objective is significant.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors study a setting called combinatorial cascading bandits, where the utility is the minimum utility of a set of actions, and the feedback is the observed utility and the first action that causes the utility to be 0 (utility is either 1 or 0).

Algorithms are proposed with theoretical guarantees and empirical validation.

As a whole, this paper is solid, with rigorous problem definitions, algorithm develop & analysis, and empirical validation.

However, the research contribution of this paper compared to the CascadeUCB paper seems a bit marginal, and I'm having a hard time determining the magnitude of the contribution.

I hope the authors can more clearly state the technical novelty of this work compared to previous work.

Right now, it appears to just be a relatively modest twist, with a slightly modified algorithm and basically the same type of analysis.

So I'm not sure if the contributions justify a NIPS paper.

For instance, it's possible I'm under-estimating the technical challenges in adapting the theoretical analysis from previous work to this setting.

Minor comments:

-- The paper is missing a discussion of the relationship with ranked bandits, especially with the disjunctive objective version.

*** RESPONSE TO AUTHOR REBUTTAL *** I thank the authors for their rebuttal.

The rebuttal sharply describes the difference from previous work, and has suitably addressed my main concerns.
Q2: Please summarize your review in 1-2 sentences
This paper is solid and well-written, but the magnitude of the research contribution relative to previous work is unclear.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Dear reviewers,

We wanted to thank you for your reviews, praising the technical quality of our paper and its presentation, and suggesting acceptance in 4 out of 6 cases. Our rebuttal addresses two major issues raised by your reviews: novelty with respect to [9] and [11], and tightness of our analysis.

*** Novelty ***

Our work is novel in several aspects:

1) We propose a new learning setting for stochastic partial monitoring with combinatorial action sets that can model many interesting real-world problems. One of these problems is learning a routing path that fails with the lowest probability. The feedback is the index of the first failing path segment. Another problem is learning to recommend a list of items subject to diversity constraints that satisfy the user with the highest probability. The feedback is the index of the first satisfactory item. We evaluate our learning algorithm on two real-world datasets and show that it is practical.

2) Our algorithm, CombCascade, is not that of [9] and [11]. CombUCB1 [11] chooses solutions with the largest *sum* of the UCBs of items. CascadeUCB1 [9] chooses K items with the largest UCBs. CombCascade chooses solutions with the largest *product* of the UCBs of items. In the setting of [9], any list of K items is feasible, both CascadeUCB1 and CombCascade can learn the optimal solution. However, under general constraints, it is critical to maximize the product of the UCBs. CombUCB1 may lead to suboptimal solutions and we show this in Section 4.1.

3) The proof of Theorem 1 is completely different from those of Theorems 2 and 3 in [9]. The proofs in [9] are based on counting the number of times that each suboptimal item is chosen instead of any optimal item. This argument requires that the items in the solutions are exchangeable. Therefore, it can be only applied to special combinatorial sets, such as matroids. We utilize a part of the analysis of [11] (pages 12 and 13) to achieve linear dependency on K in Theorem 1. The rest of our analysis, including all technical lemmas and the proof of the gap-free bound, is completely novel.

Our main technical contribution is the reduction of our analysis to that of stochastic combinatorial semi-bandits. Roughly speaking, the analysis of CombCascade can be trivially reduced to semi-bandits by conditioning on the event of observing all items (lines 231-237), and increasing the expected regret for choosing any solution A from \Delta_A to \Delta_A / p_A, where p_A is the probability of observing all items in A. This reduction is problematic because it overestimates the regret when p_A is small and yields a regret bound that depends on a potentially huge \max_A (1 / p_A).

We address this issue by formalizing the following insight into our problem. If the expected reward for choosing A is small, the learning agent can distinguish A from A^\ast without knowing the expected weights of all items in A (lines 238-243). Technically speaking, the learning agent acts implicitly on the prefixes of suboptimal solutions, and we choose them in our analysis such that the probability of observing all items in these prefixes is "similar" to that of observing all items in A^\ast, and the gaps are "similar" to those of the original solutions. Then we count the number of times that the prefixes are chosen instead of A^\ast when all items in the prefixes are observed. Our analysis relies heavily on several structures in our problem, such as that our reward and observation models are closely related, [expected reward for choosing A] = p_A x [expected weight of the last item in A]. We are unaware of any prior work that exploits such structures in combinatorial learning. Our reduction yields an extra factor of 8 / [expected reward for choosing A^\ast], which is typically a significant improvement over \max_A (1 / p_A).

*** Tightness of our analysis ***

The lower bound in Theorem 4 [9] applies to the disjunctive variant of our problem and is \Omega((L - K) (1 / [K \Delta]) \log n) for p = 1 / K. This indicates that our upper bound in Theorem 1 is off by a factor of at most O(K^2). We believe that this factor is intrinsic to CombCascade, and we show this on the following class of problems. The ground set are L = K^2 items. The feasible set are K tuples:

\Theta = {(1, ..., K), (K + 1, ..., 2 K), ..., (L - K + 1, ..., L)}.

The expected weights of the first K items are 1. The expected weights of the remaining L - K items are (1 - K \sigma)^{1 / K}. Therefore, the expected rewards for the optimal and suboptimal solutions are 1 and 1 - K \sigma, respectively. The gap is K \sigma. Our empirical results are reported at:

https://dl.dropboxusercontent.com/u/43441773/RegretCombCascade.pdf

For all \sigma, the regret in n = 10^6 steps is linear in L - K. This is suggested by our upper bound, which is O(K (L - K) (1 / [K \sigma]) \log n). We do not know yet if the gap between the upper and lower bounds can be closed.