Paper ID: 573 Title: Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
Reviews

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper develops a model for multifurcating trees with edge lengths and observed data at the tree leaves; the model is based on the beta coalescent from the probability literature. The authors develop an MCMC inference scheme for their model, in which they draw on existing work that uses belief propagation to perform inference for the Kingman coalescent (an edge case of the beta coalescent in which all trees are binary). The particular challenge for inference here is that there are many more possible parent-child node relationships when parents can have multiple children (not just two). The authors seem to use a Dirichlet Process mixture model (DPMM) at each node to narrow down the space of possible children subsets to consider. As the authors note, even inference with the Kingman coalescent is a hard problem. In experiments, they compare to the Kingman coalescent and hierarchical agglomerative clustering.

The Kingman coalescent is a popular modeling tool, so it is great to see a practical extension of the Kingman coalescent to the multifurcating case being explored for inference. The authors give a convincing discussion of the interest in multifurcating trees. The DPMM trick to narrow the space of children to consider for a parent is another neat idea here and seems to be validated experimentally. That being said, there seem to be some issues with readability and notation in the paper (in particular, it is difficult to read on its own without the Teh, Daume III, Roy paper and also its own supplementary material, and even then there are some confusing points). And while I think that it is useful to have a beta coalescent extension to the Kingman coalescent regardless, it seems surprising that the Knowles, Ghahramani 2011 work on multifurcating trees (which comes from the diffusion, rather than coalescing, perspective for building trees) is not included in the experimental comparison (or at least discussed more). Moreover, given that the main selling point here seems to be the use of multifurcating trees in modeling, it would be useful to see comparisons of the trees returned by the Kingman and beta coalescent approaches.

Some further major points:
* Page 2, lines 56-57: My understanding is that this sentence is saying that the Dirichlet diffusion tree (DDT) and Pitman-Yor diffusion tree (PYDT) do not include a generative model for edge lengths. Such a statement would be false. An integral component of both models is the distance between child and parent nodes, which is determined by a divergence function. These trees also typically come with a (Brownian-motion-derived) generative model for leaf node locations in some (typically Euclidean) space. That the DDT and Kingman coalescent are in some sense inverses has been exploited for modeling (see the recent work of Teh et al on fragmentation-coagulation processes). From this perspective, it seems disingenuous to claim that this paper is the first to provide a model of edge lengths for multifurcating trees since the PYDT does so. The contributions of the current work seem, rather, to be the development of inference with the beta coalescent; these presumably have their own merits over, and separate from, the PYDT (just as the Kingman coalescent and DDT have their own merits).
* A number of points in Algorithm 1 are unclear.
---- * Step 8: Is i local to s? Should it have an s subscript? Or should this incrementation be outside the for loop?
---- * Step 14: pi doesn't really seem to be defined in the main text. Page 2, line 67: The "tree structure" pi is mentioned but not defined, and Page 4, line 166: the "tree structure" theta_i is mentioned but not defined. How does pi differ from theta? Without definition, it's unclear what the subtraction and addition in Step 14 mean.
* There are other notational issues:
---- * It is unclear what the role of p_0(y_i) is or how it is chosen. It is not clear why the current authors diverge from Teh et al, who seem to use a base distribution at the root, and here instead use a distribution at each node. How does this distribution interact with the transition kernel? Z_{-\infty}(x) does not seem to be defined explicitly (Eq. 6).
---- * The superscript s seems to be introduced on page 4, line 168 without being defined or discussed. There later seems to be some confusion about what is particle-specific and what is not. E.g., line 241, page 5: what does Omega_i restrict? All particles? The notation for Omega_i later switches to an s dependence. If a DPMM is run for each (s,i) pair, it would be helpful to be explicit in the text about this. Conversely, in Eq. 11, why does lambda_{n_i - 1}^s have an s superscript? That seems to at least overload the notation on page 2, line 83.
* Some details and choices in the experiments were confusing, e.g.:
---- * Why was alpha = 1.2 chosen for the beta coalescent? This seems somewhat arbitrary. Likewise, why the choice of 5 chains with 5 particles per chain (page 7, line 364) vs 5 chains and 80 particles each later (page 8, line 401)? Also on line 364, how many dimensions were kept after using PCA?
---- * In the definition of the purity score, is the purity score averaged over all pairs?
---- * Page 7, line 338 suggests that the synthetic data comes from a beta coalescent generative process, but the appendix seems to demonstrate this is not true. That the generating process is not a beta coalescent should be clearer in the main text. Moreover, in the supplementary material, it's not clear how the actual nodes (as opposed to the number of nodes) for merging are chosen at each round. Presumably the true model is not used for synthetic data generation because it is too computationally complex; this is worth mentioning in the text.
---- * It's unclear why the effects described on line 345 on page 7 are evident. Does increasing the number of observations have no effect simply because the maximum (and best) score is already attained?
---- * How were the sample hierarchies in Figs. 3 and 4 chosen? Were they, e.g., just the final samples in the chain?

* Page 1, line 38 (and elsewhere): The term "bushy" is often used to describe binary trees (in particular, balanced binary trees). The authors here seem to use it describe multifurcating-ness of trees. If the authors wish to use the word "bushy", I would suggest defining their desired usage early on (at least before the first use in the main text besides section headings).
* Page 2, line 80: Be explicit about the Kingman coalescent representation as a Lambda coalescent. Namely, the measure Lambda is the Dirac delta measure at 0 in this case.
* Page 2, line 103: Can anything specific be said to support the statement that this approach is "more scalable"? E.g., some asymptotic bounds/analysis?
* Page 4, line 187: Is this meant to be computational complexity just for the ith coalescent event? If so, n_i - 1 or n_i should be used instead of n. If it's for all events, since the Kingman coalescent considers every pair of nodes after each coalescent event in the tree, it would seem the cost of growing a Kingman coalescent is something like a constant times n^2 + (n-1)^2 + (n-2)^2 + ... + , which would be asymptotically like n^3, not n^2. And the result for the beta coalescent would be not obvious in this case.
* Page 5, line 217: Consider using a different notation for the identity matrix and the indicator function (cf. Eq. 9). Define the identity matrix notation at first use (here) instead of line 92, page 2 in the supplementary material. I'm not sure why ref [10] is referenced on page 5, line 217 or why this relationship between the mutation rate and Dirichlet concentration parameter is chosen.
* Eq. 12: Should this be t_b instead of t_{b_i}?
* In supplementary material Section 2, x_n is overloading notation with the main paper. Also, n_k should be explicitly defined, and also z_i^k (which appears to be 1\{ z_i == k\} ).

=== Update after Author Feedback ===

The Author Feedback addressed my concerns as much as seems possible in the constrained space. I am optimistic this will be a solid paper after a round of editing.
This paper provides practical inference for modeling the beta coalescent---an extension to the popular Kingman coalescent and which allows parents in learned hierarchies to have multiple children instead of just two. However, it is somewhat difficult to read on its own due to notational confusion, errors, and missing definitions, and it would benefit from a visual comparison of binary and multifurcating trees returned by the Kingman and beta coalescent frameworks.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors of this paper propose a new inference algorithm for the beta coalescent when applied for discovering hierarchical structures in data. Their work is a generalisation of the belief propagation framework in [1] to beta coalescent. Belief propagation and sequential Monte Carlo (SMC) techniques have been combined before ([1]) to infer the hidden tree structure in coalescent models. However, during inference the number of all possible children to coalesce should be considered. This number grows prohibitively when beta coalescent is considered rendering inference intractable. Towards this, the authors use a novel approach; they use Dirichlet Process mixture model (DPMM) as part of the SMC proposal. SMC selects a children set from DPMM to coalesce. Thus, the DPMM effectively restricts the number of possible children sets making inference tractable. The results on synthetic data and after comparing the proposed algorithm with the greedy one that enumerates all the possible children sets, illustrate that for small datasets the two algorithms are comparable. For bigger datasets, the prohibitive nature of the greedy algorithm bans it from applicability, underlining thus the outperformance of the proposed one. Moreover, the author compare the beta coalescent to Kingman's coalescent and hierarchical agglomerative clustering (hac) [2] on real dataset and showing that the beta coalescent outperforms.

The authors present a novel idea for improving the inference in beta coalescent models which, in my opinion, can effectively be used for $\Lambda$-coalescent models in general. However, I am confused about the focus of the paper. Both, in abstract and in Section 1, they make clear that their contribution is in inference. However, they also mention that they present results of the beta coalescent on real dataset. The beta coalescent is a known model and to my knowledge, it has been used extensively. I wouldn't regard it as a main contribution of this paper. However, I am pleased to see a comparison of the beta-coalescent with the Kingman's and the hac. In their experiments on real datasets, they look at the structures found by the models and examine them qualitatively. Maybe the authors could include results in the predictive performance of the three models too. That would be an interesting addition. Moreover, comparing the performance of the proposed inference algorithm to the simple enumeration one on real dataset would add to the paper.

--- Section 2 ---
line 055: The authors mention that the Pitman-Yor diffusion trees (PYDT) model by [3] doesn't distinguish edge lengths. I am not sure how correct this is. The edge lengths in PYDT are being determined by the divergence times in a similar way they are being determined in the beta coalescent model they discuss about.

Typos:
line 038: nodes --> node

The writing of the paper is clear with only a couple of confusing points which I already discussed. The proposed inference algorithm builds on an existing inference algorithm for coalescent; it's incremental contribution. It also provides a qualitative comparison of the beta coalescent to Kingman and hac which is interesting.

[1]: Yee Whye Teh, Hal Daumé III, and Daniel M. Roy. Bayesian Agglomerative Clustering with Coalescents. In NIPS, 2008

[2]: Eads, D. Hierarchical clustering. Scipy, 2007.

[3]: Knowles D., Z. Ghahramani. Pitman-Yor diffusion trees. In UAI, 2011.

**************************************
***UPDATE after author's response***
**************************************
I want to thank the authors for their detailed rebuttal.
The paper presents an interesting inference algorithm for learning a Bayesian model of a beta coalescent. However, I am not sure whether this is supported enough by the actual presentation in the paper. If the authors added a comparison to other algorithms or models, I would be more convinced; either a comparison to other multifurcating models (like Knowles, Ghahramani PYDT) or other inference methods for models with beta-coalescent such as the ones in [19]. These would add to the paper. I do understand though, that the code for these models might not be publicly available, and asking for an implementation of these from scratch might be beyond the scope of this paper.
The current paper proposes an nice inference idea that contributes incrementally to the inference of coalescent models. As an idea of itself is nice and could be incorporated in many similar models, but I am not sure whether it is significant enough. A clearer presentation of the contribution would be appreciated.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
An extension to the Kingman’s coalescent, the “beta coalescent,” is proposed. The beta coalescent allows for non-binary branching of the tree. The authors derive belief propagation and sequential Monte Carlo methods for inference. Experiments with synthetic and two real world datasets are performed comparing the beta coalescent to Kingman’s coalescent and hierarchical agglomerative clustering.

This paper presents a nice---if fairly straightforward---generalization of Kingman’s coalescent to the non-binary case. The usage of a DPMM as a proposal distribution for SMC to make inference tractable was clever and the justification provided for why the DPMM would be good proposal was strong. The experimental results provide further support for the usage of the DPMM.

While the paper was technically strong, the need for a “bushy” hierarchical clustering model was not well-motivated in the introduction. The authors should expound on at least 2 of the examples cited there. The authors claim “it is crucial to model hierarchies that are non-binary.” While I am personally inclined to agree with them, they provide no justification for the statement.

The two real data examples given were quite illuminating. I was intrigued by Fig 3: in the human tissue example only three non-binary branches were present and in the newsgroup example there was only one non-binary. I am thus surprised that the beta model significantly outperformed the kingman model. An explanation of this in the paper would be appreciated.

One other concern is that for the synthetic data example, the the beta model did not outperform the competitors and the authors fail to provide an explanation as to why this might be.

UPDATE (based on author feedback):
The authors are thanked for their thorough response. The key points are for the authors to clarify some of the technical details pointed to by the other reviews, the conditions used in the experiments and the interpretation of the experimental results.
A technically strong (if incremental) contribution with many potential applications. The paper would benefit from stronger motivation and further exposition in the results section.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the anonymous reviewers for their detailed comments on our paper presenting belief propagation applied to the Beta coalescent. In our response, we will review our goals in presenting our work, discuss the tough choices we made in exposition, comment on related work, and then cover remaining questions.

=== Motivation ===

Reviewer_6 is right; we do not claim that we are proposing the Beta coalescent; it is an established model. However, efficient inference methods (which have been applied to other Bayesian clustering algorithms) have not yet been applied to the Beta coalescent. Doing so would allow us to model bushy data more effectively.

Reviewer_7 points out that the need of the bushy structures is not well-motivated. We will try to motivate the need for bushy structure and explain the challenges of modeling bushy structure better in the introduction and background, in addition to the biological examples we explore later in the paper.

=== Exposition ===

Reviewer_5 thought that this paper was difficult to read without Teh et al’s paper. To conform to NIPS space limitations, we chose to assume that readers were familiar with Teh et al’s Kingman coalescent. We thank the reviewers for their concrete suggestions on how to ameliorate this dependence to make both the paper and the supplemental material more self-contained.

Because we are applying our method to larger trees, it is difficult to visually compare the output of the Beta coalescent and Kingman’s coalescent. Under NIPS space constraints, we chose to show the output of Beta coalescent because (a) that is the novel model and (b) because it is harder to render a binary tree. However, we offer many numerical comparisons between the methods. However, we sympathize Reviewer_5’s preference for qualitative comparisons, and we will gladly add them to the supplemental information.

To use space efficiently, we also represented trees using compact notation. This caused some confusion. Reviewer_7 reasonably inferred that the trees discovered in the newsgroup data were mostly binary; it is not, however, because we lumped subtrees with only leaves into a single supernode (e.g., one node has 20 children, all of them leaf nodes).

=== Related Work ===

Reviewer_5 and Reviewer_6 correctly point out that diffusion trees also include edge lengths. We misstated this. However, Knowles and Ghahramani completely ignore edge lengths in their evaluations. This prevents us from knowing the quality of their edge lengths. Unlike their paper or other previous clustering algorithm, we do evaluate edge lengths. Thus, while this is not the first paper to model edge lengths, this is the first Bayesian clustering paper to evaluate edge lengths experimentally. We will clarify this distinction and correct our mistake.

=== Experiments ===

* The parameter alpha needs to be between 1 and 2, and we compared different alpha value on synthetic data set, with DPMM proposal, different alpha values (outside of extreme cases) don’t change the outcome too much so we didn’t include this analysis in the paper.

* We use 5 chains throughout. For synthetic data we use 5 particles and 80 particles on real data. We use the first 10 dimensions of features after dimensionality reduction.

* The purity score is computed as: for any of the two leaf nodes with the same class label, we find the smallest subtree containing these two leaves, and measure the fraction of the leaves in this subtree with the same class label. The expected value of this fraction is the purity score. This is also used in Teh et al’s paper and Heller and Ghahramani’s 2005 paper.

* The sample hierarchies in Figure 3 and Figure 4 are the final sample from one chain.

Reviwer_6 suggested comparing exhaustive enumeration and the proposed method on real data sets. However, even for synthetic data, it is only tractable for datasets with ten observations; thus it would be impossible on real data, so we needed to do these experiments on synthetic data to have any results whatsoever.

Reviewer_7 correctly points out that the beta coalescent is better than existing clustering algorithms on real data but only comparable on synthetic data. Synthetic data are simple, and so inference can build binary trees that more closely approximate the true bushier structure. This is more difficult on real data.

=== Other issues ===

We heartily thank Reviewer 5 for their careful reading:

* Incrementing i should indeed be outside the for loop in Algorithm 1.

* We defined \pi on line 94, \pi_i is a partition of nodes at time t_i, while \theta_i is a subtree at time t_i; we will ensure that readers are reminded of these definitions.

* p_0(y) is defined the same as Teh et al, except we change notation from q(y) to p_0(y), and both are an initial distribution at each node. This initial distribution and messages from children together decide the local normalizer Z at this node. We also followed Teh et al to use the notation Z_{-\infty}(x) to denote the root normalizer. It is the same as Eq 5 in our paper.

* Superscript s was mentioned implicitly on line 168: the notations without s introduce the concept, while adding the superscript s is used to distinguish between different particles, for example, \Omega_i and \lambda_{n_i - 1}.