Paper ID: 668
Title: Fast Classification Rates for High-dimensional Gaussian Generative Models
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
-- i don't have many comments, other than to alert the authors to this manuscript: Q. Mai and H. Zou, "A Note On the Connection and Equivalence of Three Sparse Linear Discriminant Analysis Methods," Technometrics, vol. 55, pp. 243-246, 2013. [Online]. Available: http://www.tandfonline.com/doi/abs/10.1080/00401706.2012.746208

which demonstrates that several "different" sparse discriminant algorithms are, in fact, the same. thus, this contribution is perhaps even more useful.
Q2: Please summarize your review in 1-2 sentences
the authors provide novel misclassification rates under perhaps the simplest high-dimensional classification problem, comparing several different approaches.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
+ Well-written paper, claims are clear, + Best convergence rates for the logistic regression classifier under high-dimensional sampling regime, - The usefulness for Machine Learning is left stranded, - The end of the paper is sloppy and finishes quickly while there is room for a conclusion and a discussion.
Q2: Please summarize your review in 1-2 sentences
This paper presents convergence rates for linear discriminant models and zero-one loss for high-dimensional classification problems. The main result is a convergence rate proportional to the inverse of the training set size, the number of covariates and the sparsity of the Bayes classifier for the logistic regression case (Theorem 2).

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
I have read the paper ``fast classification rates for high-dimensional conditional Gaussian models".

The paper studies

the problem of binary classification

using a Gaussian model and provides some theoretical results on the convergence of the classification error rates (compared to the Bayes classifier).

The paper presents some nice theoretical results and is interesting to some extent. I am generally positive about the paper but I have the following concerns.

First, it is about the practical relevance.

One of the very important point about classification is that we need a tuning free trained classifier; when a classifier

has tuning parameters that need to be tuned, we usually use data splitting

to decide such tuning parameters. Therefore, it is more relevant to either study

the error rates of a tuning free classifier, or that of a classifier with tuning parameters

selected by data splitting.

This is important for that, while the convergence rate of the "ideal tuning parameters"

may be fast, the convergence rate of the classifier that is actually used could be very slow, simply because of the convergence of the data-driven tuning parameters to the ideal tuning parameters are slow.

In terms of this, the paper needs to (a) compare more carefully with recent tuning-free classifiers, (b) address the effect of selecting the tuning parameters by data-splitting.

Second of all, the paper has put all the focus on the convergence rate,

does not

contain any insightful comparison of the computing speed.

The penalization methods

the authors present are all computationally slow when $p$ is large. It is desirable to

have a detailed comparison about the computational complexity or speed.

Third of all, given this is a well-studied topics, it is desirable to compare

with different methods using at least 1-2 real data sets (say, microarray data sets). The methods that need to compare include: those discussed in the paper [3,7,13,15,17] if the method is easy to implement as well as support vector machine (SVM) and random Forest.

Fourth, the paper comments that assuming $\Sigma^{-1} (\mu_1 - \mu_0)$ is sparse is weaker than assuming both $\Sigma^{-1}$ and $(\mu_1 - \mu_0)$ are sparse. Mathematically, this is true. However, the more relevant point is that, whether we have an important application example

(say, in genomics or genetics) where $\Sigma^{-1} (\mu_1 - \mu_0)$ is sparse but either $\Sigma^{-1}$ is non-sparse or

$(\mu_1 - \mu_0)$ is non-sparse. If we don't have such application examples in areas many people care, then arguing how

two assumptions are different from each other is largely due to mathematical

curiosity,

not due to scientific interest.

Last, the presentation can be improved. The material can be streamlined so that the main contribution can be more highlighted.
Q2: Please summarize your review in 1-2 sentences
Good paper in terms of theory, but the authors seems to have very little concerns on practical matters (tuning parameter selection,

computation speed, real-data analysis). This largely down-weights the importance of the paper.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper establishes a framework for evaluation the error of linear discriminant approach given conditional Gaussian models. It presents a faster rate of convergence than all existing results without using standard assumptions on separation or low-noise. The paper ends a bit suddenly. A better simulation study and a discussion section would be good additions.

Well written. Clear. Better results.
Q2: Please summarize your review in 1-2 sentences
It presents a faster rate of convergence than all existing results without using standard assumptions on separation or low-noise. A better simulation study and a discussion section would be good additions.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their kind comments regarding the novelty and importance of our work.
Some reviewers felt that a Conclusion and Discussion section would greatly add to the paper: we will be sure to add this in the camera ready version (including in part the discussions in the responses below).


AR2:

We thank the reviewer for pointing out the related work by Mai and Zou.
It would be interesting to look into the relationship between those various methods and l1 regularized logistic regression in a later work.

AR3:

We thank the reviewer for comments on the presentation of our paper.
We will make relevant edits in the camera ready version.

AR4:

Regarding choosing the tuning parameter (\lambda) in l1 regularized logistic regression: in practice, \lambda is typically chosen using cross-validation or via information-theoretic measures such as BIC. Providing error bounds for such practical selections of lambda is still an active area of research, and certainly an interesting line of future work.

Regarding the tradeoff in computational complexity when using logistic regression in l1 penalty with respect to other related work: We agree that an explicit tradeoff and comparison of computational complexities is an interesting question. It was slightly beyond the focus of this work, which was on statistical rates, but we will definitely expand upon this in an extended version.

Regarding motivating the sparsity assumption from an application perspective: we will do so in the camera-ready version. In particular, many social network and bioinformatics applications have dense connected components in the inverse covariance graph: here, though the inverse covariance graph is not sparse, our less stringent sparsity assumption would still hold.

AR5:

We thank the reviewer for their kind comments on our paper.

AR6:

Regarding the usefulness of our work to machine learning:
(a) In theoretical contributions: we provide a "direct" approach to analyzing 0-1 classification error rates rather than using the traditional approach of bounding the log-likelihood loss, and then using transforms (inequalities), to bound the 0-1 classification error.
(b) In methodological contributions: we systematically studied linear discriminant classifiers under a conditional Gaussian model, and showed that the simple approach of logistic regression with l1-penalty statistically outperforms other linear discriminant approaches, including many *recently proposed* approaches. This has strong consequences for practical users of machine learning based classification techniques.



AR8:

Our paper focuses on the theory of linear discriminant classifiers under conditional Gaussian models, and experiments were only used to demonstrate our theoretical results. In particular, the experiments section presented synthetic simulations whose main-purpose was to show that the bounds we obtain are tight. We plotted the 0-1 error rate against each parameter (sparsity (s), dimension (p), number of samples(n)), and show the stacking of the curves against the control parameter for different (n,p,d) settings.