NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID: 657 Causal Bandits: Learning Good Interventions via Causal Inference

### Reviewer 1

#### Summary

The paper presents a bandit framework that incorporates causal aspects. Given a structural equation model (SEM) and a list of possible interventions, the agent chooses one intervention and observes all variables in the SEM, including the reward. The additional observations (besides the reward) provide additional information and an opportunity to outperform bandit algorithms that do not use these observations. The most convincing analysis is for the case where the reward has multiple causes that the agent cannot simultaneously control. For instance, if the agent chooses to fix one cause, the observation of the other causes provides the means to reduce the reward noise. The extension to general graphs relies on the assumption that one knows the probabilities of the causes of the reward under each intervention. I believe this is too strong an assumption to be meaningful.

#### Qualitative Assessment

Overall I find interesting to see how causal knowledge and additional observations lead to improved bandit algorithms. This has important implications because, in all practical problems that have been advertised as perfect for bandits, there are opportunities to observe additional variables connected to the arm and the reward in known or unknown causal ways. I regret however that the authors have to make excessively strong assumptions to treat the general case.

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

### Reviewer 2

#### Summary

The paper studies a variant of the multi-armed bandit problem, in which the goal consists in identifying good interventions when information about the causal dependencies is available. The paper presents two algorithms, one intended for the special case where the factors have causally independent contributions to the effect variable, and one for a "general" case. In particular, the general case is conceptually simplified to a multi-armed bandit problem with stochastic actions and correlated arms (= the conditional distributions over the direct causes of the effect variable given an intervention), which allows leveraging some of the insights gained in the analysis of the parallel bandit. Simple regret bounds are provided. Finally, the authors evaluate the two proposed algorithms on the parallel bandit (comparing to another bandit algorithm).

#### Qualitative Assessment

Pros: - The problem of leveraging causal information in bandits is relatively unexplored and important. The authors provide insight into the challenges of exploiting additional causal structure. In particular, I liked the measure $m$, which may be thought as a measure of the "effective number of arms" in a parallel bandit problem. - The mathematical analysis is clean and sound. Cons: - I fail to understand exactly in what way the causal information has been used to derive the analysis presented in the paper. Some of the major pitfalls in causal inference, such a the presence of confounders, seems to have been bypassed by the requirements for Algorithm 2. - The experiments were not as thorough. Is there an easy way to compare the performance against some version of UCB or Thompson sampling? Furthermore, why isn't there an experiment with a bandit that isn't parallel? - There are some missed opportunities. In particular, I would have preferred a more thorough analysis of the graph structure's impact on the bandit problem. It appears that some of the insights and major technical difficulties are unrelated to the causal nature of the problem. POST-REBUTTAL ============= Thank you very much for addressing my comments. I have one comment regarding the Ortega & Braun citation in line 88. The point of that paper (and actually also the JAIR 2010 paper by the same authors) is that Thompson sampling is a *consequence* of treating actions as causal interventions in a causal dynamical model; not an "extension of Thompson sampling" as you wrote in the text.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 3

#### Summary

The bandit problem is one of interesting topics in artificial intelligence. In my understanding, similar to Bareinboim et al. (2015), the main contribution of this paper is to connect causal inference with the traditional stochastic bandit problem.

#### Qualitative Assessment

My concern is as follows: First, the discussion of this paper is based on the intervene-then-observe ordering of events within each round". As the authors' statement the context is observed...." for Bareinboim et al (2015), Bareinboim et al. (2015) mainly focuses on the stochastic bandit problem in the presence of unmeasured confounders. However, in my understanding, the formulation by Bareinboim et al. (2015) can be also applied to the case where the context is observed after the intervention. Thus, the problem can be solved by Bareinboim et al. (2015) without the proposed approach. Please give the detailed discussion on the difference between Bareinboim et al. (2015) and the proposed approach. Second, I wonder what can be solved by the parallel bandit algorithm but cannot be solved by the existing algorithms. Since P(Y|do(X_i=j))=P(Y|X_i=j)) holds, there seems no significant difference between the traditional bandit algorithms and the proposed algorithm. In addition, since it is assumed that we can conduct an intervention in this paper, in my understanding, the problem of the choice probability P(X_j) of X_j does not occur (i.e., the action for which P(X_i=j) is small...." is not a problem). Third, the authors mention that we study the problem of using causal models to improve the rate at which good interventions can be learned online in a stochastic environment", but I did not know where the authors use the property online" in the proposed approach. I wonder if the existing approaches are not based on the on-line learning. Please clarify what the online" means in this paper and where the property is used in the proposed approach. In addition, the authors mention that Wu et al. (2015) ... need to know the likelihood of each factor in advance", but the authors assume that the conditional interventional distributions are known. I think the proposed approach has a similar deficiency to Wu et al. (2015). Totally, the authors emphasize on the differences between the present approaches and the existing approaches in \S 1, but it seems to me that such differences are small. Please clarify the merits of the proposed approach compared to the existing approaches.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 4

#### Summary

The paper examines a bandit scenario, where the actions are interventions and the agent gets the causal DAG of the environment as additional information. First, a simple special case is considered where there is no confounder. Then the general scenario is considered, although, for simplification, it is assumed that a certain conditional is known. Algorithms, theoretical analysis and simulation experiments are presented.

#### Qualitative Assessment

Strengths and weaknesses: --- The clarity of the main parts has clearly improved compared to the last version I saw as an ICML reviewer. Generally, it seems natural to investigate the direction of how causal models can help for autonomous agents. The authors present an interesting proposal for how this can be done in case of simple bandits, delviering both, scenarios, algorithms, mathematical analysis and experimental analysis. However, the contribution also has strong limitations. The experiments, which are only on synthetic data, seem too show that their Algorithms 1 and 2 outperform what they consider as baseline ("Successive Rejects") in most cases. But how strong of a result is that in the light of the baseline not using the extensive causal knowledge? In the supplement, I only read the proof of Theorem 1: the authors seem to know what they are doing but they spend to little time on making clear the non-trivial implications, while spending too much time on trivial reformulations (see more detailed comments below). In Section 4, assuming to know the conditional for the objective variable Y seems pretty restrictive (although, in principle, the results seems to be generalizable beyond this assumption), limiting the usefulness of this section. Generally, the significance and potential impact of the contribution now strongly hinges on whether (1) it (approximately) makes sense that the agent can in fact intervene and (2) it is a realistic scenario that strong prior knowledge in form of the causal DAG (plus conditionals P(PA_Y|a) in the case of Algorithm 2) are given. I'm not sure about that, since this is a problem the complete framework proposed by Pearl and others is facing, but I think it is important to try and find out! (And at least informal causal thinking and communication seem ubiquitously helpful in life.) Further comments: --- Several notes on the proof of Theorem 1 in the supplement: - Typo: l432: a instead of A - The statement in l432 of the supplement only holds with probability at least 1 - \delta, right? - Why don't you use equation numbers and just say which equation follows from which other equations? - How does the ineq. after l433 follow from Lemma 7? It seems to follow somehow from a combination of the previous inequalities, but please facilitate the reading a bit by stating how Lemma 7 comes into play here. - The authors seem to know what they are doing, and the rough idea intuitively makes sense, but for the (non-expert in concentration inequalities) reader it is too difficult to follow the proof, this needs to be improved. Generally, as already mentioned, the following is unclear to me: when does a physical agent actually intervene on a variable? Usually, it has only perfect control over its own control output - any other variable "in the environment" can just be indirectly controlled, and so the robot can never be sure if it actually intervenes (in particular, outcomes of apparent interventions can be confounded by the control signal through unobserved paths). For me it's hard to tell if this problem (already mentioned by Norbert Wiener and others, by the way) is a minor issue or a major issue. Maybe one has to see the ideal notion of intervention as just an approximation to what happens in reality.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 5

#### Summary

I have read the author feedback, and stand by my recommendation for accepting the submission as a poster. The paper studies a novel problem formulation that can apply to several real world problems (e.g., many applications that are currently modeled as contextual bandits). After reading the author feedback, I am suitably convinced that this formulation is not trivially solved by partial monitoring or Bareinboim et al, 2015 etc. I can see several follow-ups that study the connections between graph properties of the underlying causal graph and simple/cumulative regret and jointly learning causal structure online, etc. more thoroughly. Alg2 and its assumptions are unfortunately too limiting, but the basic ideas can spur further research. --Earlier review follows below-- The paper studies best arm identification in multi-armed bandits, where on each round, a reward as well as a bunch of covariates are revealed to the learner (like in partial monitoring games). These covariates influence the reward through a known causal model, and this additional information is used to develop an algorithm with minimax optimal simple regret (up to log factors) for "parallel" causal graphs (covariates independently influence reward). Moreover, an algorithm that requires additional information about conditional intervention distributions (to estimate a quantity m < #actions) provably achieves \tilde{O}(sqrt(m/T)) simple regret. Simulations show that these methods can indeed perform better than the Successive Reject algorithm by knowing the correct causal model.

#### Qualitative Assessment

I have read the author feedback, and stand by my recommendation for accepting the submission as a poster. The paper studies a novel problem formulation that can apply to several real world problems (e.g., many applications that are currently modeled as contextual bandits). After reading the author feedback, I am suitably convinced that this formulation is not trivially solved by partial monitoring or Bareinboim et al, 2015 etc. I can see several follow-ups that study the connections between graph properties of the underlying causal graph and simple/cumulative regret and jointly learning causal structure online, etc. more thoroughly. Alg2 and its assumptions are unfortunately too limiting, but the basic ideas can spur further research. --Earlier review follows below-- The paper was clear to follow and introduces a novel model in multi-armed bandit problems with well-analyzed algorithms. I appreciate that code implementing the experiment was available in the supplementary. My comments below focus only on the weaknesses. The paper motivates a nice model of reward-feedback in the multi-armed bandit problem. However, please elaborate on the differences between this model and combinatorial partial monitoring (e.g. Tian Lin et al, Combinatorial Partial Monitoring Game with Linear Feedback and Its Applications, ICML 2014). In particular, are there suitable choices of encoding the outcome vector, action matrices etc. in CombPM that recovers every causal model instance? Or the other way around? Or are the two incomparable (with a counter-example in each direction)? The simulated experiment was a weak point of the paper. First I don't understand why lil'UCB was not simulated (rather than/in addition to Successive Reject). Secondly, I do not understand the comment about Alon et al's algorithms for adversarial bandits with feedback graphs. Yes a malicious adversary can set q = 0 but I don't see why Algorithm 1 or 2 (or their variant) do any better than their algorithm in that case (m(q)=N then). Thirdly, the terrible performance of Alg 1 and 2 in Fig 2(a) only convinces me that the asymptotic regret analysis is giving us an incomplete picture of these algorithms' performance (for instance, being reward oblivious is nearly minimax optimal simple regret!) I would like to see a discussion on how better to characterize their performance (e.g. would cumulative regret be the right thing to tell these algorithms apart? Would the idea sketched in Section 6 then be nearly minimax-optimal too)? Assuming that conditional intervention distributions is known is, as the paper admits, quite limiting. I can also intuitively see that we need some way to link the action we take to how it affects the outcomes for the parents of Y in the causal graph (as also the reward outcome Y). Is this assumption equivalent to saying, the results provably work only for causal models where all non-Y nodes are parents of Y (so, a parallel graph with additional edges between X_i) and intervening on some subset of the non-Y nodes are the allowed actions? If so, I find the chain graph example distracting (yes, multiple intervention rewards can be observed there, but we cannot really exploit that without additional knowledge). Interesting choice of truncation, and I found the comment about empirical results when B_a is set higher intriguing. Am I correctly understanding that clipping as Y_t min{B_a, R_a(X_t)} rather than Y_t R_a(X_t) I{R_a(X_t) < B_a} will introduce less bias and could work better (although worst-case analysis does not change)? On line 280: no truncation was performed for Algorithm 2, so could it also be unbounded variance hurting Alg 2? Minor comments: 201: P{X_1 = 1} = 0. 321: "reach round" -> each

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 6

#### Summary

The paper introduces a new class of bandit (sequential decision making) problems called Causal Bandit problems, set in the context of a known causal graph. In each round of a sequence, an action is taken that sets the values of a subset of the variables in the graph. The values of all other variables (including a designated reward variable) are then observed, conditioned on the intervention made. These steps are then repeated. The goal is to minimize the regret (the expected reward lost relative to the best action). The framework is similar to that of stochastic contextual bandits, but the order of action and observation is reversed. The paper introduces algorithms for a special case and the general case of the problem, and provide regret bounds for each. The algorithms are (briefly) evaluated empirically.

#### Qualitative Assessment

Addressing questions of causality in machine learning is important. The paper makes a very natural generalization of bandit problems, placing it in the context of a causal graph. I think the work takes good initial steps but doesn’t go very deep into causal inference, given that the causal graph is known and that Algorithm 2 relies on knowledge of the conditional interventional distributions. The paper is well structured and well written. It was a pleasure to read. As you identify, the major drawback of Algorithm 2 is that it requires knowledge of the conditional distribution. On the other hand, Algorithm 1 relies on the assumption that the causal graph is "parallel". In practice, this assumption is often not true. Does Algorithm 1 perform better or worse than standard bandit algorithms in this setting? What happens if there is confounding as in Fig 1b? It seems to me that you have only evaluated empirically the setting that corresponds to the parallel case. While this is natural as it conforms with theory, it would be interesting also to see what happens when practice diverges from theory. I would like to understand the relation to contextual bandits a bit better and I would have liked to see an example problem that highlights the need for causal bandits. It seems that in the farmer example, you could potentially observe the environment before taking action, making it a contextual bandit problem. Which would perform better? Regarding the difference to contextual bandits, you say in the related work section that "the extra observations are only revealed after selecting an intervention and hence cannot be used as context". While this is true, your framework does admit the empty do() action, and hence these "context" observations could be collected in the previous time step. Clearly, the context could change between steps, but I think the relation is more than "superficial" as you put it. In Section 6, You identify several important open problems related to your work. I get the feeling from your text that you already have ideas on how to approach some of them. For example, when discussing use of the reward signal, you mention a suggested modification and its pros & cons. Is there a reason why this is not included in the main body of the paper?

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)