Paper ID: 293
Title: Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
Reviews

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary: The paper presents a hard clustering algorithm for clustering batch sequential continuous data. The algorithm is derived by performing a low variance asymptotic analysis of the Gibbs sampling algorithm for the dependent Dirichlet process Gaussian mixture model (DDPMM).
=====
This is a well written paper that is easy to follow. The work is technically sound and I only have a few minor concerns.

Originality: Moderate. The authors perform a small variance asymptotic analysis for the DDPMM Gibbs sampler and although a similar analysis has previously been performed on the Dirichlet process mixture Gibbs sampler, the model and algorithm considered here are sufficiently different.

Significance: The paper is interesting in two ways: 1)It demonstrates that it is possible to perform small variance analysis on models more sophisticated than Dirichlet process mixtures and 2) It develops an efficient hard clustering algorithm for time evolving data.
=====

Detailed Comments:
1) It appears that the analysis depends critically on the transition and observation distributions being Gaussian. Are extensions to models using discrete observation and transition distributions straightforward?
2) It is a little concerning that the small variance analysis is performed on a particular inference algorithm (Gibbs sampling) for the DDPMM. Can something be said about the model directly, independent of the inference algorithm?
3) Selection of the free parameters of the algorithm seems cumbersome, performing an exhaustive parameter search is often not feasible for large datasets. How were the parameters chosen for the aircraft trajectory data?
4) In the toy experiment, how were the Gibbs sampler and VB initialized? Particularly poor initializations, for e.g., all data being assigned to the same cluster, may partly be responsible for the large performance gaps. Were multiple runs of VB performed?
5) It is difficult to judge the aircraft trajectory clustering results from the visualization alone. Quantitative numbers would be useful. How about L2 error on held out trajectories? This error would obviously be minimized if each trajectory is its own cluster, but can still be useful in comparing two clusterings with comparable number of clusters.
6) How were the Gibbs sampling results unreliable?
Q2: Please summarize your review in 1-2 sentences
Overall, this is a solid piece of work and a well written paper. There are some minor concerns with the experimental section.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary: The paper presents a deterministic inference algorithm for a clustering
problem for batch sequential data (data in each batch exist in clusters, and
clusters across batches are related as per some evolving structure). The Dependent
Dirichlet Process Gaussian mixture model (DDP-GMM) is a nonparametric Bayesian method
for dealing with such data. The paper uses the idea of doing low-variance asymptotics
on the Gibbs sampling update equations in a DDP-GMM model (a similar approach was
recently proposed by Kulis and Jordan for DP-GMM in their DP-means algorithm) and
derives a deterministic algorithm that is (1) more efficient to implement than Gibbs
sampling for DDP-GMM, (2) can learn the number of clusters just like DDP-GMM. Experimental
results on a synthetic and a real dataset show that the proposed algorithm runs fasters
than other inference methods for DDP-GMM such as Gibbs sampling, variational inference,
and sequential Monte Carlo.

Quality: The technical quality is rigorous and the details appear to be correct.

Clarity: The paper is well-written and the exposition is easy to follow.

Originality: The paper is based on recently proposed idea of using low-variance
asymptotics on the Gibbs sampling update equations in case of DP mixture models.

Significance: The paper addresses an important problem (scaling up nonparametric
Bayesian methods for large datasets).

Paper Strengths:

- The proposed algorithm is intuitive, seems easy to implement, and is scalable for
large datasets.

- Although the basic recipe is borrowed from the DP-means work of Kulis and Jordan
(taking the Gibbs sampling updates and applying the low-variance asymptotics), the
ideas in section 3.2 (parameter updates) are rather novel in this context.

Paper Weaknesses:

- The algorithm has 3 free parameters. For an unsupervised algorithm (with no option
of doing cross-validation), this is a bit worrying. Although the paper gives some
suggestions on how to choose the parameters, it is not clear how well they will work
in practice.

- There should have been a comparison with a simple variant of the DP-means algorithm
(Kulis and Jordan, 2012) for batch-sequential data (see the suggestion in the comments
below on how to do it).

- On real-data, there is no comparison with variational inference or particle learning.


Comments:

- It would have been nice to compare the proposed dynamic-means algorithm with
a very simple adaptation of the DP-means algorithm: Run DP-means on one batch and
initialize means for the next batch using the results of this run. I would be
curious to see how dynamic-means compares against this alternative.

- It turns out that instead of doing small-variance asymptotics on the Gibbs sampling
updates, one could also do MAP-based asymptotics to derive hard-assignment based
variants of nonparametric Bayesian models such as DP-GMM or IBP-based latent
feature models (there is a recent paper by Broderick et al from ICML 2013:
"MAD-Bayes: MAP-based Asymptotic Derivations from Bayes"). Is it possible to derive
the dynamic-means algorithm using this approach? Please comment.

- There is work on temporal DP mixture model based on recurrent Chinese restaurant
process which is similar in spirit to dependent DP mixture models. Please see this:
"Dynamic Non-Parametric Mixture Models and The Recurrent Chinese Restaurant
Process : with Applications to Evolutionary Clustering" by Ahmad and Xing (2008).
It would be nice to have a discussion about this link of work. Please also comment
whether doing low-variance asymptotics on these methods would lead to similar
algorithms as dynamic-means?

- In the DP-means algorithm of Kulis and Jordan, the order of data points mattered.
In the cluster assignment step of the dynamic-means algorithm, does the order of data
points (within a single batch) matter? Please comment.
Q2: Please summarize your review in 1-2 sentences
It is a reasonable paper, although somewhat incremental (based on the DP-means algorithm
by Kulis and Jordan). There might be some issues with its usefulness in practice (requires
choosing 3 parameters) and the experimental evidence in the paper is somewhat limited.
But I still consider the paper to be taking a step (albeit small) in an important research
direction (leveraging the flexibility of nonparametric Bayesian models for large-scale datasets).

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors provide a novel small variance asymptotic analysis of the Dependent Dirichlet Process (DDP). The authors first show how to derive the small variance limits and then provide a deterministic algorithm for inference. They show that their algorithm monotonically decreases a (bounded) cost function, proving that their algorithm will converge. The authors next apply the new algorithm, Dynamic Means to synthetic data comparing to several Monte Carlo based inference methods for the DDP and the DP-Means algorithm. Finally, the authors apply their method to a real world problem of tracking aircraft trajectories, showing improved performance vs DP-Means and Gibbs sampling of the DDP.

This a reasonably interesting paper. The idea of using small variance asymptotics to derive hard clustering algorithms is not novel. However, the specific application to the DDP does raise some novel technical challenges. Deriving an algorithm which works for correlated Dirichlet processes is interesting.

The main weakness of the paper currently is the lack of adequate description of the alternative methods used for benchmarking. In particular it is not clear:

1) How hard clustering was done for the Monte Carl samplers? Was a maximum posterior estimate taken, or was a more sophisticated approach based on the posterior similarity matrices of the data points used?

2) It is not clear what particle filtering or VB algorithm the authors use.

3) The authors mention the they enforced consistency across time points when computing the accuracy metrics. Some clarification on this would help.

For the synthetic data it would be useful to show the accuracy of the methods vs time run plotted. I am extremely surprised the Dynamic Means method performs so much better than sampling based approaches, and suspect it is largely due to under-sampling for the Gibbs and PL. I realise the main point of the algorithm is that it performs much better then than sampling methods at a similar computational budget, so I think an exploration of time vs accuracy is important.

Typos:

- Line 44 "in the both the" should be "in both the"

- Line 102 $n_{k t}$ should be $n_{k \tau}$

- Lines 405-406 I am not sure why a 10 x 20 grid creates a 400 dimensonal feature vector, I would think its 200.

Minor Comments:

- It would be nice if the elements in equations (3) and (4) where in the same order.

- The text on the figures is difficult to read. A larger font would be better.
Q2: Please summarize your review in 1-2 sentences
A reasonable paper which could pass the publication threshold with some revisions.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
The authors kindly thank the reviewers for their feedback. We have identified a few main points from the review that should be addressed: 1) The algorithm has too many free parameters, resulting in difficult tuning; 2) There is a lack of comparison with a modified DP-means algorithm; 3) Whether generalizations of the algorithm or alternate derivations are possible; 4) There is insufficient description and results in the experimental sections; and 5) Minor clarifications about aspects of the paper and the algorithm itself.

Concerning the number of parameters: A strong advantage of the present algorithm is that it possesses the combination of low computational cost and relatively smooth variation in performance with respect to its parameters. Due to these properties, automatic optimization methods are particularly applicable for parameter adjustment (e.g. gradient-based schemes[1] or Bayesian optimization[2]). There exists a large body of literature dedicated to addressing the issue of model selection[3], and as such these techniques were not mentioned in the paper; the authors will make a note of this in the final draft. Further, the existence of these parameters was not a design choice made by the authors. They are a direct result of each of the underlying processes (birth, transition, death) of the DDP mixture; any asymptotic analysis would yield a set of three parameters, as is required to describe those three actions. Finally, the authors provided a transformation to make parameter selection based on expected data characteristics more intuitive.

Regarding the proposed modified DP-means algorithm, the authors contend that the method has drawbacks which prevent devoting attention to it in a short paper. Essentially, as the cluster transition variance or death probability increases, the proposed method degrades. Two simple examples (among many) of this degradation: First, if a cluster moves further than lambda in one time step, it will always create a new cluster, and the parameter from the past time step will essentially never be used at all, preventing useful tracking; second, as the cluster death probability approaches 1, any new cluster near a recently deceased cluster will always be incorrectly linked to that deceased cluster. Correctly accounting for death and transition is what fundamentally sets the Dynamic Means algorithm apart from DP-means. In addition, the authors preferred to compare to more well-known methods, to demonstrate advantages over the state-of-the-art.

It is possible to extend this work to non-Gaussian distributions, based on recent work on small-variance asymptotics for Dirichlet process mixtures of general exponential family distributions[4]. In terms of alternative derivations, there were two approaches mentioned by the reviewers: the Recurrent CRP, and MAD-Bayes. The former does not allow for the death of clusters, and the same holds true upon taking the low-variance asymptotic limit of the Gibbs sampler. Otherwise, a similar algorithm would result from a low-variance asymptotic analysis. A MAD-Bayes-like analysis of the DDP-GMM is possible; one would wrap the transition into the parameter prior, and account for the death probability in the exchangeable partition probability function. However, based on the low-variance MAP analysis of the DP/DP-means, the low-variance MAP and Gibbs sampler analyses are expected to be similar to one another. The authors therefore decided upon the Gibbs sampler-based approach for ease of understanding.

Regarding the experimental section, the authors agree that a time budget vs. accuracy comparison would be an insightful inclusion, and will add it in the final draft. In the ADS-B section, PL and VB time results can be added into final draft. The authors understand the desire for quantitative analysis - however, L2 error isn't a good comparison metric. Indeed, comparisons of unsupervised clustering algorithms without labelled data is an open problem. Thus, the authors will hand-label a subset of the data, and provide accuracy results on that subset.

One of the reviewers was uncertain about the term "enforcing consistency" - this just means that data have to be clustered with the correct old parameter. This is of utmost importance in tracking the evolution of clusters over time. For Gibbs sampling/VB/PL, the results show reasonable clustering in each time step, but the data are being linked to the incorrect old clusters (poor tracking). Note that "enforcing consistency" is not a part of any of the algorithms; it is simply how accuracy was computed.

The comparison algorithms used in the experimental section were "PL for general mixtures" and "mean field variational inference" (as cited in the introduction). The authors agree that this was not clear in the experimental section, and citations will be added there in the final draft. Furthermore, comparisons for the sampling methods were performed conservatively (as described on lines 369-372) - for these techniques, the highest accuracy sample was selected, assuming knowledge of the true labeling. Further, the algorithms were initialized as follows (based on highest performance): Gibbs sampling was given "no initialization" (i.e. parameters and labels are introduced as necessary), and VB/PL were initialized randomly. Finally, the Dynamic Means algorithm's performance does depend on the order in which the data is assigned its labels. The authors acknowledge that this may not have been clear in the exposition of the paper, and will include that in the final draft.

1 - Yoshua Bengio. "Gradient-Based Optimization of Hyperparameters", 2000.
2 - Jasper Snoek et al. "Practical Bayesian Optimization of Machine Learning Algorithms", 2012.
3 - Joseph Kadane et al. "Methods and Criteria for Model Selection", 2004.
4 - Ke Jiang et al. "Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models", 2012.