NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID:2174
Title:An urn model for majority voting in classification ensembles

Reviewer 1

Summary

In this paper, the authors describe a new algorithm for dynamic ensemble pruning for classification tasks. The main idea of the algorithm is to stop querying more classifiers of an ensemble if the probability that the current classification result for one instance will coincide with the result obtained by querying all classifiers of that ensemble reaches a preset threshold. This allows to trade off classification accuracy with classification speed. In contrast to previous approaches, this algorithm models the ensemble voting process as urn problem, where each classifier is represented as one marble in the urn and its classification result is modeled as the color of this marble. Querying a subset of classifiers can then be seen as a draw without replacement from this urn. This model makes it possible for each dataset to take into account the prior distribution of class biases of the different classifiers. In their paper, the authors show that their solution is equivalent to the Statistical Instance-Based Algorithm (SIBA) when a uniform prior distribution of those class biases is assumed. Furthermore, they propose to use Monte Carlo estimation to obtain non-parametric non-uniform prior distributions and compare their algorithm with SIBA. The authors report that the disagreement rate of their algorithm is closer to the preset threshold than SIBA’s, which leads to a significant speed up, since less classifiers need to be queried.

Qualitative Assessment

Technical quality: In my opinion, the proofs provided for proposition 2 and 3 are correct. The performed experiments seem suitable to compare the proposed approach with SIBA with respect to disagreement rate and classification speed. One particular nice thing about the performed experiments is that the authors also compare their approach with a more simple dynamic pruning algorithm, which stops querying classifiers if their vote would not change the result. However, it would be nice if the authors would have evaluated the reported measurements using statistical measures or tests. For instance, the authors argue that Table 1 shows that their approach produces disagreement rates closer to the specified one than SIBA. This could be easier illustrated by at least reporting the mean disagreement rate over all datasets. Also, by performing statistical analysis the authors could potentially show significant differences between the disagreement rates of both algorithms. Novelty: This paper improves on SIBA in that it allows to adjust to the class bias of an ensemble, instead of assuming a uniform distribution. It is novel in the way the authors model the classification using an ensemble of classifiers as drawn without replacement from an urn of marbles and derive from this model a sound way to determine the confidence that a queried subset of classifiers will agree with the complete ensemble. Potential impact: I believe that this work can be interesting for many people, since it provides a sound way to trade off voting accuracy and speed in an ensemble of classifiers, which can be useful when dealing with large amounts of instances. It has also a broad applicability, since this algorithm is not limited to work with random forest, but could also be applied to an ensemble of different classifiers. Clarity and presentation: Overall, the presentation of the paper is good and the work is clearly described and explained. Especially the main intuition behind the proposed algorithm is well explained and the shift from an urn of marbles to an ensemble of classifiers is done nicely. The mathematical notation as well as the proofs in section 2 are very clear and rather easy to follow. It is helpful that the authors state the Chu-Vandermonde Identity, upper negation and the symmetry property of the binomial, which they are using to prove proposition 2 and 3. However, there appear to be two minor editing mistakes in the proofs. In the equation under line 124, it is unclear how the authors arrived at the expression on the left-hand side. In my opinion, it would have been enough to copy the right-hand side of the equation under line 123, since applying upper negation to this expression yields the right-hand side of the equation under line 124. Additionally, there is a plus sign instead of a minus sign in the upper bound of the second sum symbol in the equation under line 138. The current upper bound is $T - \sum_{i=1}^{l-2} T_{i}^{*} + t_{l}$, but it should be $T - \sum_{i=1}^{l-2} T_{i}^{*} - t_{l}$ instead. One issue in the presentation of the paper are the plots. While the plot of the vote distribution is good, the plots depicting the disagreement rates should be revised. According to the values on the x-axis and the description in the paper the x-axis should be named “1 - alpha” instead of “alpha”. Additionally, it is impossible to distinguish the different lines in these plots in a black and white printout. Therefore, I would recommend to use different line types to make it more clear which line belongs to which label without relying on different marker types and colors. Throughout the paper, the authors refer to the used sampling technique as “Mortecarlo sampling”. However, I believe this should be changed to “Monte Carlo sampling” or properly referenced as different sampling technique. Other minor spelling issues are: line 7: “to” missing after “possible” line 36: “on” instead of “of” line 71: repetition of the word “marbles” line 98: “than” instead of “that” line 108: “infinite” instead of “infinity” line 131: “by” missing after “simplified” line 255, 258, 263: “a” instead of “an”

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 2

Summary

The paper gives a sampling method for speeding up the evaluation of ensembles based on exact computations of probabilities.

Qualitative Assessment

The main problem of speeding up the evaluation of ensembles is a useful one. The proposed approach of using exact calculations and an estimated prior seem sensible, if somewhat incremental, and rather standard. The experiments were decent. The presentation was generally clear. I wonder if the calculations and computations would be easier if the trees were sampled with replacement, rather than without? Perhaps there would be a decrease in performance, but possibly not. Also, perhaps in this case everything could be simplified and approximated using Sanov's theorem? I'm afraid I did not understand the significance of Proposition 3, which would seem to be a main result. Is the point that this is helpful in computing these probabilities? The paper should definitely compare to Reyzin ("Boosting on a budget"). Other small comments: line 71: marbles marbles (seems to happen elsewhere). In general the paper needs a thorough proofreading for typos, grammar errors, etc. line 72: It seems that T is meant to be a vector, but the notation indicates that it is instead a set of numbers. line 105: Seems very odd to use 1/||T|| as notation for the number of possible T vectors since this notation suggests that this quantity is a function of T, which it is not. line 116: Why is r allowed to be a complex number? line 141.5: What does the notation "==...=" mean? line 147: I'm not sure exactly what is meant by "out-of-bag" in this context. This seems different from what out-of-bag usually means in the context of random forests. line 152: Sliding window over what? line 174: I assume the authors mean Monte Carlo? If not, what is meant by Mortecarlo?

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 3

Summary

The paper proposes an urn model for choosing how many classifiers to be queried at minimum so that the final decision is similar to the querying the whole ensemble. The contribution relies on the incorporation of non-uniform prior for the T vectors (content of the urn). The experimental studies demonstrate the advantage of considering non-uniform a priori over uniform as well as querying the whole ensemble in a random forest.

Qualitative Assessment

There is a slight contribution to an existing approach for choosing the number of classifiers to combine. Nevertheless, it seems the approach is promising and eventually chooses less number as a previous approach (SIBA). The paper is sound and well written, however, I have some comments as follows: - Since the main contribution is on adding a non-uniform priori for T, it would be better to explain better the section 2.2. For example, how T^n_i are mapped to T_i? and what is the 'sliding window' in line 152? It is sliding over what? - Add statistical significance test while comparing the SIBA and Hyper in the Table 2 - Add number of classes (and possibly instances) for each of the datasets in Table 1. Are they all binary? If so, could you add some datasets with more classes? - When using look-up tables for each class (non-uniform case), what is the stopping condition? Is it when the minimum number of votes for a class is pooled? - minor comments: Some typos: repeated words: "marbles marbles","where where" line 98: that -> than line 140: solved by line 195: over all -> overall Texts in Figure 1 are not eligible

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 4

Summary

The authors propose an intuitive mathematical modeling of the voting process in ensembles of classifiers using the hypergeometric distribution. They show speed up results on various UCI and synthetic datasets.

Qualitative Assessment

There is not mention of how big the datasets are but I think the UCI datasets are not that big and so its hard to say anything about the algorithm based on the speed ups reported. I would like to see the results on large datasets for any meaningful comparisons.

Confidence in this Review

1-Less confident (might not have understood significant parts)


Reviewer 5

Summary

The paper deals with classification ensembles for uniformly weighted classifiers using a voting scheme. The goal of this work is to improve classification speed by omitting the evaluation of some subset of classifiers as long as the final class label is already estimated within a specified confidence level. This work improves on a previous publication (Hernandez-Lobato et al., 2009) by adding an informative prior distribution to the model, whereas before a non-informative uniform prior was used. The prior is learned from the training set using an empirical Bayes approach. The experiments, performed on 21 datasets (mostly UCI), show a decrease of the average required number of classifiers (Standard Random-Forest ~64%, Hernandez-Lobator et al. ~19%, proposed ~16%), while also generally being better at hitting the specified confidence level.

Qualitative Assessment

The paper is well written and clearly laid out. The improvements in comparison with the referenced state-of-the-art could be shown consistently, but may not be terribly exiting when put in practical terms. On average, the proposed method requires the evaluation of approx. only 3% less base classifiers of the ensemble then the reference method. Yet, the experiments also indicated that the proposed method may perform very close to the theoretical optimum. For the evaluation, I missed some summary statistics over all datasets (tables 1 & 2). Some spelling errors should also be fixed for the camera ready version (line 24 'performace', line 27 'off-line' and 'online', lines 174+237+245 'mortecarlo' -> 'Monte Carlo').

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 6

Summary

The paper describes an algorithm to speed up the majority voting process in classifier ensembles. The improvement is achieved by reducing the number of classifiers needed to achieve a performance comparable to the full ensemble. Results are compared with the SIBA methodology.

Qualitative Assessment

1. “Generally the different subsystems are combined by means of a voting process…” you are leaving out all papers related with aggregation operators. I think they are relevant enough to avoid such general imprecise statements. 2. Figure 1 is barely readable. Not self-explanatory captions. 3. Although results are very interesting the applicability of the methodology can be questioned by the fact that it has been proven only with a particular type of ensembles, namely random forests. It would be great if you could provide comparable results with another type of ensembles.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)