
Submitted by
Assigned_Reviewer_8
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The authors present an interesting approach of
estimating multiple oneclass SVMs together instead of independently from
each other. The paper is well written and structured.
 in (5)(6)
the coupling between the different submodels is clear within the dual
problem. However, it is less clear in the primal formulation. Please also
explain how in (2) one can see the coupling between the submodels. On
which of the unknown variables is this taking place?
 could the
hierarchy constraint also be interpreted as a form of structural risk
minimization?
 in Figure 1 a nice example is given with 4
hierarchical MVsets. However, as a user here one should choose the number
4 beforehand. Is it also possible within the framework to treat this
choice as a model selection problem? (as e.g. in clustering: Alzate C.,
Suykens J.A.K., Hierarchical Kernel Spectral Clustering, Neural Networks,
vol. 35, Nov. 2012, pp. 2130)
 Though the authors use many data
sets, it is not clear how the kernel parameter has been selected (or any
other parameters) in the different methods. This is a rather important
aspect.
 It would be good to discuss differences also with the
following references:
I. Steinwart, D. Hush, and C. Scovel, A
classification framework for anomaly detection, J. Mach. Learn. Res., vol.
6, pp. 211232, 2005.
A. Rakotomamonjy, M. Davy, Oneclass SVM
regularization path and comparison with alpha seeding, Proc of 11th
European Symposium on Artificial Neural Networks, Brugges, 2007

There appears to be an error in the Gaussian kernel in footnote 2: .^2
instead of (.)^2
The authors have carefully responded to the
questions. I have updated my score accordingly.
Q2: Please summarize your review
in 12 sentences
Interesting approach of estimating multiple oneclass
SVMs in a coupled way. The model selection aspects should be clarified,
also in making the comparisons.
Submitted by
Assigned_Reviewer_9
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Title: qOCSVM: A qQuantile Estimator for
HighDimensional Distributions
Summary: This paper proposes an
unsupervised method for learning the quantile of a distribution. The
idea is based on one class SVM. The major contribution of the paper
seems to be the fact that multiple level of quantiles can be trained
at once. The experiments on UCI datasets suggest that qOCSVM work
better than training multiple independent OCSVM.
The paper is
wellwritten. Extending oneclass SVM to deal with multiple level of
"novelty" seems to be a good idea. However, I feel that the
algorithmic contribution is too incremental. The experimental results
seem encouraging.
The bar graphs in Figure 3 and Figure 4 are
difficult to read. Change to color graphs maybe?
The choice of
Gaussian width can be crucial. Fixing hyperparameters to certain
values is not convincing: it is important to perform some cross
validation over the hyperparameters. Q2: Please
summarize your review in 12 sentences
An extension of one class SVMs for learning several
quantiles. The experimental results seem encouraging but it could have
been done more carefully. Submitted by
Assigned_Reviewer_10
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The oneclass SVM (OCSVM) algorithm, due to by
Scholkopf et al, finds a hyperplane in feature space that maximizes the
separation between the training set and the origin. The original OCSVM
paper showed that, for suitable choices of kernel and slack variable
penalty, the function returned by OCSVM is an indictor of the qsupport*
of the data distribution, i.e., a set of minimum volume that contains a q
fraction of the distribution. A good way to characterize a distribution is
to estimate its qsupport for many values of q, and one could simply run
OCSVM many times to generate these estimates  this paper calls that
algorithm IOCSVM. However, these estimates won't necessarily be nested,
while the actual qsupports must be. This paper describes an extension of
the OCSVM algorithm, called qOCSVM, for computing nested estimates of the
qsupports of a distribution. Other papers have also proposed extensions
of OCSVM with this property, but this paper's approach for enforcing the
nesting constraint is very simple: the learned hyperplanes are required to
be parallel, i.e., they only differ in their bias term. The paper modifies
the OCSVM optimization problem to include this parallel constraint, and
shows that the dual is still solvable in polynomial time for very
highdimensional feature spaces, via the kernel trick. The theoretical
results in Section 4, which prove that the qOCSVM algorithm
asymptotically estimates the qsupports, are straightforward extensions of
the results found in the original OCSVM paper. The experimental results
show that the qOCSVM algorithm outperforms existing algorithms, including
IOCSVM, for several datasets, though sometimes the improvement is small.
Some might regard the simplicity of the qOCSVM extension as a
reason not to accept the paper, but I view it as a point in the paper's
favor. However, I do wonder whether an even simpler approach might work
too. For example, I accept that forcing all the hyperplanes to be parallel
acts as a kind of regularizer, and thus helps with overfitting. But
couldn't this effect be achieved with the IOCSVM algorithm by just
increasing the penalty on the norm of the weight vector? Admittedly, this
still would not guarantee that the estimated qsupports are nested. Which
leads to a question: Does IOCSVM intersect the qsupports that it learns,
as suggested in the Introduction? Also, is qOCSVM faster to train than
IOCSVM? It seems like it might be, since it's one QP and not several QPs.
If so, that would be worth mentioning. Finally, instead of setting \nu = 1
 \alpha, what if we used a heldout validation set to choose a value for
\nu that optimizes coverage ratio? It seems that might improve the results
for all algorithms.
The results in Figures 2 and 3 are averaged
over 61 and 21 datasets, respectively, but the plots do not contain error
bars, so it's hard to judge the significance of the results. On the other
hand, the fact that the qOCSVM algorithm dominates the others for all
values of q is encouraging. Also, I think the left and righthand side of
each figure contain the same information, and thus seem redundant.
The paper is clear and wellwritten, and the relationship to
previous work is wellexplained. Figure 1 is a very helpful illustration.
For completeness, it might be good to cite some work on estimating the
quantiles of a _conditional_ distribution. Search for "nonparametric
quantile regression" or "kernel quantile regression".
* The term
"qsupport" is mine, not the paper's. Q2: Please
summarize your review in 12 sentences
The paper shows that a very simple extension of
oneclass SVM outperforms existing methods for computing nested estimates
of all the qsupports of a highdimensional distribution. I think it's an
interesting contribution.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
With regard to the choice of the kernel width:
Parameter tuning for oneclass classifiers, in particular for OCSVMs,
is an ongoing research area. Unlike binary classification tasks, negative
examples are not available to estimate the optimality of the solution.
Consequently, we employed a common practice (ChihChung Chang and ChihJen
Lin. LIBSVM: a library for support vector machines, 2011; Azmandian etal,
Local Kernel Density RatioBased Feature Selection for Outlier Detection,
JMLR, 2012) of using a fixed width, divided by the number of features
(mentioned in line 297). An approach similar to ours was also used in the
original HMVE competing paper; a constant bandwidth was used by Lee and
Scott (2011). In future studies, it would be interesting to consider
alternative optimization criteria to allow tuning parameters with a
crossvalidation. For instance, as suggested by the reviewers, using the
average coverage ratio over all quantiles as an optimality criterion. A
note about this issue will be added to the paper.
With regard
to the primal program: Coupling between the submodels in the primal
program (Equation 2) is done by summing $q$ OCSVM programs (Equation 1),
while enforcing these programs to share the same $w$. As a result, the $q$
halfspaces in the solution of Equation 2 are different only by their bias
terms, and thus parallel to each other. A clarification note will be added
to the paper.
With regard to the questions on structural risk
minimization and regularization: Yes, we believe the hierarchical
constraint acts as a regularizer in the $q$OCSVM program, and thus the
risk of overfitting in estimations can be reduced. Note that it is hard to
obtain the same results by increasing the penalty on the norm in the
IOCSVM method, since penalties are not linked as in the hierarchical
constraint, and thus their outcomes are less expected. Furthermore, it
still won’t solve the absence of the $\nu$property in the IOCSVM
solution.
With regard to the intersections of OCSVM solutions:
The problem of intersecting hyperplanes in the solution of OCSVMs is
known in the literature: Lee and Scott (2011) discuss it and show an
example that visualizes this problem; In the context of clustering
problems, it was theoretically and empirically stated that OCSVM solutions
are not hierarchically nested (Asa BenHur, David Horn, Hava T Siegelmann,
and Vladimir Vapnik. Support vector clustering. JMLR, 2002). As suggested
by the reviewers, a note about this issue will be added to the paper.
With regard to the number of quantiles ($q$): In our
experiments we learned 19 quantiles since it nicely divides the [0,1]
range to intervals of size 0.05  a common statistical unit of measure.
We agree with the reviewer that it will be useful in some setups, such as
in hierarchical clustering, to treat this choice of $q$ as a model
selection problem. A note about this issue will be added to the paper.
With regard to error bars: Note that our qOCSVM was found
significantly better for each quantile separately (Wilcoxon, 0.05), so
that error bars provide no additional information. However, if the
reviewers still find it useful, we will add it to the paper.
We thank the reviews for the additional references. We intend
to include a discussion about them in the paper.
 