Paper ID: 622
Title: Learning with Noisy Labels
Reviews

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors study the problem of binary classification in the presence of random, class-conditional noise in the training data. They propose two approaches based on a suitable modification of a given surrogate loss function and derive performance bounds. More specifically, they provide guarantees for risk minimization of convex surrogates under random label noise in the general setting, and without any assumptions on the true distribution. Moreover, they provide two alternative approaches for modifying a given surrogate loss function. Experiments are presented on synthetic data and some benchmark datasets.

The paper is well-written and appears to be technically sound. The theoretical results are non-trivial but not extremely profound either, especially in light of existing work on related setting.

What could be called into question is the relevance of the setting. Is the assumption of having constant class noise in the training data indeed a realistic one? This question appears to be legitimate despite the theoretical nature of the paper.
Q2: Please summarize your review in 1-2 sentences
The paper is well-written and appears to be technically sound. The theoretical results are non-trivial but not very profound either, especially in light of existing work on related setting. What could be called into question is the relevance of the setting.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Learning with Noisy Labels

The paper addresses the problem of binary classification in the situation
where the training labels are corrupted by class-conditional random noise.
The authors propose 2 surrogate-loss based learning methods to address the problem:
the first exploits a simple symmetry condition on the loss function used to
provide a simple unbiased estimator of the non-noisy risk and the second
promotes the use of a weighted 0-1 loss that comes from an appropriate reduction
of the problem of learning from noisy labels.

The paper is a very clean and strong contribution. It provides original
theoretical results (e.g. learnability with convex surrogate in the case
of noise, noise-tolerance of SVMs, and so on) as well as compelling empirical
results. Everything is wrapped up in a nicely written paper.

I essentially have questions on future directions:
- the authors mention adversarial noise: before going to this point, is there something that can be said
about learnability with monotonic noise, as defined by Bylander ?
- what about richer noise models like Constant Partition Classification Noise (CPCN) noise ?
Q2: Please summarize your review in 1-2 sentences
Very good paper, providing significant result to learn binary classifiers from noisy labels using convex surrogates. Technical results are important, the writing is good and the experiments are compelling.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Overall, I found the paper quite well-written.

I think the assertion in the abstract (and then again in the Experiments
section), that you learn with 88% accuracy when 80% labels are flipped is
incorrect. You are only flipping 40% of the labels (the probability of
flip is convex combination of \rho_+ and \rho_- and not their sum). In
any, case the goal is to get better accuracy as noise rate goes to half,
not one (where the problem is trivially easy with just all labels
flipped).

I wonder if it is worthwhile to compare your methods with Kearns' SQ
setting. Since minimizing a convex loss can be done by gradient methods
(which have statistical analogues). And so you would get tolerance to
random classification noise for free. (Kearns does not allow
class-conditional noise, but I think that part can be handled easily.) I'm
not sure what kind of excess risk bounds you would get by such
SQ-simulation.

Minor quibbles:
--------------
1. You use the term PU learning, without ever defining it.
2. You use "so-called" very often. Especially for zero-one loss. Why is it
so-called? If you have objection to the name you should state it.

---
Update: Regarding using the SQ model. Even if you are using surrogate loss, your optimization problem can be solved using an algorithm that only makes statistical queries rather than data points.
Q2: Please summarize your review in 1-2 sentences
This paper considers the problem of learning in the presence of random
classification noise. In contrast with the PAC-like models, the main
goal here is if the goal is to minimize some convex loss function (with
respect to the true models), this can be done by suitable modifications
even when the labels are noisy (in many cases).

The paper also contains some experiments studying the proposed methods and
other related techniques.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
---Assigned_Reviewer_6---
We are not claiming that training data generated / obtained in real-world tend to have constant class noise. The noise model (CCN) has already been studied in the literature. While some results are known under certain assumptions on the data distribution; it is quite surprising that even under the simple noise model, no guarantees are known for minimizing surrogate losses with noisy training data, especially given that almost two decades have elapsed since the first work on classification with random noise. The paper does not only serve to settle the question in the affirmative, but also helps advance our understanding of known algorithms such as biased SVMs.

---Assigned_Reviewer_7---
We thank the reviewer for the encouraging comments and very pertinent thoughts for further research. The CPCN model assumes that the example space is partitioned such that each partition has a noise rate. Recent work by Ralaivola et. al establishes that if a function is PAC-learnable in the uniform noise rate model (CN), it is also PAC-learnable under CPCN (thus showing that the models are equivalent under PAC-learning). It would be interesting to see if our CCN learning algorithms can be cleverly used to learn under CPCN model, akin to the PAC-learnability argument for CCN=CPCN. Indeed, studying learnability under CPCN model seems to be the right question to ask given our current results. On the other hand, the monotonic noise model defined by Bylander, where the noise rate decreases monotonically with distance of example from the target hyperplane, is more realistic. CPCN model, in some sense, is a discrete version (albeit without monotonicity) of the monotonic noise model, and understanding either of the two models better would be a step forward.

---Assigned_Reviewer_8---
Thanks to the reviewer for pointing this out --- only 40% of the labels are flipped and not 80%. The idea of casting gradient based convex minimization algorithms in Kearns' Statistical Query (SQ) framework to obtain noise-tolerant algorithms 'for free' is an intriguing one and does seem worth pursuing. However, the focus of this work is on modifying *surrogates* to obtain noise-tolerant algorithms. The SQ technique, if successful, would work by modifying *algorithms* (ensuring, for instance, that they only make statistical queries). Also, we use "so-called" only to mean "commonly known as", but the reviewer's question shows the overlooked ambiguity of the usage -- we will remove this ambiguity in future versions.