Export Reviews, Discussions, Author Feedback and Meta-Reviews

Paper ID:	1805
Title:	Large-Scale Bayesian Multi-Label Learning via Topic-Based Label Embeddings

Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

A Bayesian approach depending only on positive labels is presented for multi-label learning problems. I'm not so familiar with the framework but assumptions (6)-(9) should be more explained even if such priors are usual priors. Is there any restriction if D

> L, L > D or

L small

?

The authors focus on Gibbs sampling and EM, but other inference algorithms could be discussed.

Please give the timings in Table 1 for each algorithm.

Q2: Please summarize your review in 1-2 sentences

The Bayesian framework is well comprehensive and the paper is clearly written. The method is satifyingly justified, and the objective is clear.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

This paper addresses the problem of large-scale multi-label classification. The key feature of the proposed algorithm is the construction of a new likelihood model for multi-label data, through the introduction of auxiliary latent variables.

Inference in this model has a nice property: it scales linearly with the number of positive examples. Making the algorithm particularly suited for the sparse positive labels case. The generative model is simple enough that it is possible to compute all expectations required for the E-step in closed form, allowing for efficient optimization. The authors then compare their algorithm to several concurrent state-of-the-art algorithms in different datasets, obtaining very competitive results.

The paper is very well written, the derivations are clear and the experiments show compelling results.

Some specific comments:

The authors stress in section 2.2 that the resulting link function is asymmetric - contrasting this to the logistic link-function, which is symmetric. However, can't we make it asymmetric by merely introducing bias parameters, as commonly used in GLMs/MLPs?

It seems that this is an interesting remark, but it does not deserve a whole section dedicated to it.

The proposed generative model for labels introduces an intermediate non-linearity between the features and the labels, Eq. (8). This is also pointed-out as one of the innovations of their model (Sec. 6). This extra non-linearity makes the model similar to an MLP with one hidden layer. It would be interesting if the authors had discussed this relationship in more details or even provided an empirical comparison on the same datasets.

Q2: Please summarize your review in 1-2 sentences

The paper is very well written, the derivations are clear and the experiments show compelling results.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

This paper is about a new Bayesian method for multi label learning. The goal is to classify accurately in settings where there are many potential labels but only a few of them apply to each data point.

The basis of the new results is a new generative model for the label vector of each example. Specifically the label vector y_n of the n-th example is generated as

y_n = f(V(\sigma(Wx_n)),

where Wx_n is a lower dimensional projection of the n-th instance x_n, followed by an element-wise sigmoid activation \sigma. The final operation f corresponds to drawing Poisson random variables with rates given by V(\sigma(Wx_n)) and thresholding these so-called latent counts by taking the minimum with 1.

The inference problem boils down to (heuristically) estimating various parameters of the model using Gibbs sampling and EM. In doing so, the authors observe that the latent counts are sufficient statistics of the model parameters.

Evaluation: I found the generative model interesting and natural as it combines low-rank approximations with non-linearities in an interesting way. It can also be interpreted as a 3 layer neural net.

I'm not sure however that inference in this model is really feasible on larger problems. The authors evaluate the approach on four tiny data sets. The largest one has fewer than 20k samples and one has as few as 40 test samples. It's not clear to me that these benchmarks are representative of real problems. The authors have yet to demonstrate that this approach will be competitive with, say, WSABIE on even medium size data sets (doesn't fit in memory but on a single machine). In particular, I don't follow the narrative in this paper which repeatedly emphasizes the "large-scale" or "massive data" aspect when there isn't even a medium data benchmark.

At any rate, the multi label problem is also interesting for small data and in this regime the approach appears to be competitive with or better than state of the art.

Complementing stronger experimental results, it would also be interesting to have more theoretical support for the inference method. Currently the approach is heuristic.

Small remark:

- The g() function in (10) appears to be a typo. If it's supposed to be there, please define it.

Q2: Please summarize your review in 1-2 sentences

A new generative model for multi-label learning via Bayesian inference. Experimental results on small data sets are encouraging.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

The authors present a novel learning framework for multi-label prediction problems.

The framework (BMLPL) is a generative approach, which utilizes a nonlinear feature extraction of observed features to generate label vectors.

The approach depends only on positive labels, which proves useful in contexts where the absence of a label could either mean it is negative or unobserved.

Importantly, the authors show that their method scales only with the number of positive labels - a useful feature for sparse label matrices.

The authors then propose

efficient inference methods based on Gibbs sampling and EM, respectively.

The paper provides a robust explanation of the proposed method, including the presentation of an asymmetric link function that more rapidly approaches 1, making it less reliant on negative labels.

Furthermore, the connection to topic models is an interesting one and provides the added benefit of interpreting the mapping of labels into lower dimensional space.

The author(s) conclude with experimental comparisons to other label-embedding approaches.

Results show superior AUC to prior work, although the comparison to LEML (Yu et al.) does not appear to be significant.

It would be helpful if the authors explore this comparison in more detail.

Furthermore, while the authors claim their method is more robust to label imbalances, there is very little explanation of how this arises from their framework, and no justification for the claim in the experimental results.

Among the 4 datasets utilized, the authors should be able to extract this information.

The experiments further show that BMLPL performs well when labels are removed, showing the benefit of not relying on negative labels.

Lastly, the results show that EM using conjugate gradient approximations efficiently arrives at accurate solutions compared to complete EM and Gibbs sampling.

One criticism is that the computational run time of BMLPL, a major advantage according the authors, is not compared to other methods either analytically or experimentally.

Regardless, the paper provides an intriguing new direction in multi-label prediction.

Q2: Please summarize your review in 1-2 sentences

The paper is interesting from a multi-label prediction perspective but requires more work in terms of experiments.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

Nice understandable paper that I can imagine enjoying in person. The use of a Bayesian model with a truncated Poisson makes sense to me.

I don't find AUC tables entirely convincing, though. Even more so here because the differences in AUC, especially for LEML and BMLPL, aren't huge. Plus AUC doesn't give any insight into the errors and misclassification rates.

Q2: Please summarize your review in 1-2 sentences

nice paper

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

The paper describes a generative approach towards multi-label classification that uses only positive data, while making no further assumption about unknown versus negative labels. This leads to fast learning algorithm for sparse positive data and avoids the problem of balancing classes.

The method exploits correlations between classes through a latent embedding using a poissonian factor model. The core part of the contribution hinges on work from Zhou, Hannah, Dunson and Carin (2012) that study the Poisson-Factor model and suggest a trick for Multinomial sampling and derive a Gibbs sampler.

The paper under review applies the Poisson Factor model to the task of multi-label classification with the following extendsions:

- extension to a cloglog link function

- restriction of Poissonian to exclude zeros

- adaptation of Zhou et al's Gibbs sampling to match the link function using Polya Gamma sampling

- two alternatives to Gibbs sampling: EM algorithm and EM with Conjugate Gradient method.

The paper evaluates the runtime performance versus prediction performance (AUC) on four data sets with varying amounts of classes, documents, and positive annotations. Additionally, the performance under missing is studied, demonstrating state-of-the-art performance

The author compare to recent, state-of-the-art methods, such as WSABIE, BCS, and LEML.

Strong Points:

- Sound method, with sound basis from Zhou et al, with significant extensions

- Important machine learning problem, with state-of-the art results

- An appropriate amount of related work is discussed

Q2: Please summarize your review in 1-2 sentences

Great paper for a relevant problem.

Author Feedback

Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.

We thank the reviewers for the careful reading and useful feedback. We will incorporate the suggestions in the final version. Our response to the reviewers' specific questions/comments is given below.

**Assigned_Reviewer_1**

- Our framework doesn't have any restrictions on the relative values of D and L. In fact, the number of labels L could potentially be very large (D can be very large as well).

- Our model enjoys full local conjugacy which makes it amenable to Gibbs sampling. Furthermore, the analytically available expectations of the latent variables allow efficient EM inference. As we discussed in section 8, it is also possible to extend these inference algorithms to the online setting to scale our framework to even larger data sets.

- We could not report the run times because the implementations of our model and the other baselines weren't directly comparable (MATLAB vs C/C++). However, our models are expected to be considerably more efficient than the other baselines (especially when the label matrix is sparse and massive) because the likelihood only needs to be evaluated on the nonzero labels (please also see the response to Assigned_Reviewer_5).

**Assigned_Reviewer_3**

- In the final version, we will include another experiment on a synthetic data to illustrate the impact of cloglog.

**Assigned_Reviewer_4**

- Following the reviewer's suggestion, we will make the discussion of label imbalance (section 2.2) shorter. Adding a bias parameter to the logistic can model the asymmetry. Our Bernoulli-Poisson model however naturally captures it (and in addition leads to computational advantages).

- As suggested, we will discuss the relationship of our nonlinear embedding based model with MLP with a hidden layer.

**Assigned_Reviewer_5**

- Quantitative improvements over LEML: LEML is one of the state-of-the-art methods for multilabel prediction and our proposed model performs comparably/better than LEML in almost all the cases. Moreover, our model offers several other advantages over LEML: (1) while the label embeddings learned by LEML are linear, our model can learn *nonlinear* embeddings; and (2) our model also produces more interpretable results, e.g., inferring topics over labels, which can be useful in its own right for many applications (e.g., learning a topic model conditioned on document features). Another key difference is that while LEML is an optimization based method, ours is a fully Bayesian approach, which makes it more suitable for other tasks such as active learning on the labels.

- More justification of the robustness against label imbalance: In the final version, we will add a synthetic data experiment and compare with a logistic model, and include a discussion.

- Computational running time: LEML has a C/C++ implementation so we could not directly compare its running time with our model's MATLAB implementation, but we will reimplement LEML in MATLAB and, in the final version, we will include an experiment reporting the timing comparison. Note however that, analytically, as we discussed in section-4, the running time of our model depends not on the label matrix size but only on the number of positive labels (in contrast, a logistic model for the labels will have a cost that depends on the size of the label matrix), and the dependence of the run-time on the other quantities (N, D, K, number of nonzero features in the X matrix, etc.) are similar as LEML.

**Assigned_Reviewer_6**

- AUC is a useful metric (and used in several recent papers on multilabel learning) because in many problems we care more about ranking the positive labels higher than the negative labels. AUC is also less sensitive to label imbalance. In the final version, in a separate appendix, we will also add some other metrics such as precision@k.

**Assigned_Reviewer_7**

- Larger scale problems: Our experiments were on moderately large data sets, if not extremely large. We would like to point out that all our experiments were done using a standard laptop with modest hardware (8GB RAM). Other existing methods can be considerably slow on such hardware (especially if the label matrix is massive). On more advanced hardware, our model is readily scalable, easily parallelizable, and can be applied to much larger data sets. Finally, the EM algorithm we proposed can be easily extended to an *online* EM implementation, which can be another way to make our method scale to much larger problems.

- About the theoretical support for the inference method: Our model admits closed-form Gibbs sampling (which provably will converge to the true posterior). The EM algorithm is also guaranteed to converge to a local optima.

- We used 'g' to denote a nonlinear mapping from the label space to the lower-dimensional space. Here the function 'g' represents a combination of the conditioning on the features and the Bernoulli-Poisson link. We will make this clearer in the final version.