Export Reviews, Discussions, Author Feedback and Meta-Reviews

Paper ID:	11
Title:	Algorithmic Stability and Uniform Generalization

Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

Note: References present in the paper are referred to by their numerical citations as used in the paper.

Summary of Paper ======================================== The paper seeks to establish a connection between algorithmic stability and generalization performance. Notions of algorithmic stability have been proposed before and linked to the generalization performance of learning algorithms [6,11,13,14] and have also been shown to be crucial for learnability [14].

[12] proved that for bounded loss functions, the generalization of ERM is equivalent to the probabilistic leave-one-out stability of the learning algorithm. [14] then showed that a problem is learnable in Vapnik's general setting of learning iff there exists an asymptotically stability ERM procedure.

This paper first establishes that for Vapnik's general setting of learning, a probabilistic notion of stability, is necessary and sufficient for the training losses to converge to test losses uniformly for all distributions. The paper then presents some discussions on how this notion of stability can be interpreted to give results in terms of the capacity of the function class or the size of the population.

Questions ======================================== - The utility of the main result of the paper is not clear - does the paper want to establish the convergence of training to test errors in cases when uniform convergence does not hold? - Can the authors demonstrate one example of a learning problem where uniform generalization is impossible but the training errors of a stable algorithm still converge to the test errors? - Can the results of [14] not be used to establish a similar result? [14] establish that learnability is equivalent to the existence of a stable AERM. Let f* be a population risk minimizer, f_ERM be an empirical risk minimizer, f_AERM be a stable asymptotic empirical risk minimizer as defined in [14]. Let R and R^ denote the population and empirical risk functionals. Then we have

R(f_AERM) <= R(f*) + epsilon (with enough samples since the problem is learnable)

<= R^(f*) + 2*epsilon (point convergence on f* using Hoeffding inequality)

<= R^(f_ERM) + 2*epsilon (f_ERM achieves smallest empirical risk)

<= R^(f_AERM) + 3*epsilon (with enough samples since algorithm is an AERM)

This establishes R(f_AERM) - R^(f_AERM) <= epsilon for any epsilon, for large enough number of samples. This recovers the result this paper is trying to show. - In general, why should one care about the difference of training and test error? It is learnability and test error that one is interested in and the result of [14] already establishes that learnability is equivalent to the existence of uniformly RO-stable asymptotically empirical risk minimizers.

- A thorough discussion detailing the contributions of the paper w.r.t. previous results is thus required. - The paper does not link its notion of stability to existing notions of algorithmic stability - this is necessary since existing notions like uniform RO-stability [14] have already been used to establish similar results. - The discussion in Section 5.1 is vague and results are not stated formally. Dropout is a widely used heuristic and recent theoretical analyses of the method have generated some interest. It seems like a lost opportunity to have a discussion on the theoretical merits of dropout but not state anything formally. - The results in Section 5.2 are not properly instantiated. The notion of ESS is introduced but never used to actually bound effective size of some observation space except for the very simple case of finite spaces in Corollary 1. - The same holds true for Section 5.3 as well. The results are not properly instantiated save the simple case of finite hypothesis classes in Theorem 3. The second result (Theorem 4) establishes a different way of calculating the VC dimension of a function class that is not necessary a classification class. However, yet again, no instantiations are given to demonstrate if this notion of VC dimension is indeed useful and can be linked, for example, to other well known notions of capacity such as uniform entropy, Rademacher averages etc. - As such it becomes difficult to assess if the results in Section 5 are of any interest and if they shed any new light on the topic.

Quality ======================================== The paper may have interesting contributions but they are neither discussed, nor properly put in context of existing results. The goal of bounding the difference of training and test errors is not well-motivated, especially when uniform convergence is not possible.

Clarity ======================================== The paper is written well.

Originality ======================================== The paper mostly uses well established techniques in information complexity such as the data processing inequality. The notion of stability via the total variation distance between distributions seems novel and interesting.

Significance ======================================== I am not sure if the paper, in its current form, would terribly excite the learning theory community since algorithmic stability is already know to be equivalent to learnability. The other results in Section 5 are poorly instantiated and it is not clear if they present novel insights into the problem.