
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary:
This paper investigates luckiness guarantees for online learning that may depend on the data and on the comparator. Deciding achievability of a proposed adaptive bound is cast as a minimax problem. This problem is subsequently relaxed using sequential Rademacher complexity. Probability tail bounds are used to give tractable expressions for achievable bounds in general and in a variety of examples.
Originality and Significance:
This is a novel extension of the gametheoretic approach to obtain adaptive bounds. Whereas most algorithms are designed bottomup, this approach starts at the core of the problem and. Tractability then becomes a tradeoff with sharpness of bounds. This method is very powerful, and will undoubtedly result in new algorithms for existing and new adaptivity targets.
Clarity and Quality:
The paper is nicely done. I was a little disappointed that the methods, after plugging in the proposed tail bounds, unfortunately cannot match the state of the art in the workedout examples. I also feel that the discussion of the results is a little onesided as it does not point this out enough.
The bound of Example 4.4 should be compared to Theorem 1 and Corollary 1 from the NIPS'14 paper "Exploiting Easy Data in Online Optimization" by Sani, Neu and Lazaric. Their bound seems strictly better, as it gives constant regret compared to the singledout strategy, independent of the time horizon and the number of other strategies. I am also a little wary of the current bound's dependence on $L$. Intuitively the whole bound should simply scale with $L$, for this is what should happen if we rescale the loss function. (if not, the results can be improved by pre and postscaling appropriately).
The bound of Example 4.5 should be compared to Theorem 4 (which trivially extends from subsets to mixtures) from the COLT'15 paper "Secondorder Quantile Methods for Experts and Combinatorial Games" by Koolen and Van Erven. Doing so, the advantage that the current paper sees over [13] (which is obtained by magnifying lowerorder terms) disappears. What remains is that the current paper has a worse linear term (loss of comparator f instead of variance of regret with respect to f) at the cost of a worse multiplicative overhead (ln n at time horizon n instead of ln ln n). It is surprising that the proposed method, which in principle should be much tighter as it is minimaxoptimal, ignores all computational issues and fixedhorizon, loses out on all fronts to a computationally efficient anytime method. It would be interesting to see a dissection of where the advantage is lost.
Minor:
line 161: move the lefthandside of this definition from line 160 into the equation.
lines 272, 372: You need a singleton $\cal X$. If it is empty everything collapses.
line 297: Why use Otilde if you spell out the log factor?
line 304: It looks that empty $\cal F(1)$ presents problems. For what is the Rademacher complexity of the empty set? If it is $0$ we have a division by $0$. If it is $\infty$ the bound 304 is unachievable.
line 311: Why is infinite set of radii an issue for exponential weights?
line 336: omit "is"
Example 4.4. I was unsure what $N$ means here. Is $N = \cal F$?
line 418: Perhaps omit this, it sounds a little embarrassing. You did find the appendix for other material.
page 9: please unify the reference style (full names vs initials) and check for capitals in titles: Hilbert, Banach, ...
*** UPDATE *** I would like to respond to this sentence from the author response:
The bound of Koolen et al is Nindependent, but is in terms of the prior (that is, MDLstyle).
The suggested difference is purely cosmetic and not intrinsic. A trivial modification of the analysis results in a KL (i.e. PACBayes) bound. Koolen and Van Erven seem to do this for their Theorem 8 but not their Theorem 4. Yet, there is no difficulty whatsoever.
Q2: Please summarize your review in 12 sentences
The paper applies gametheoretic methods to obtaining adaptive regret guarantees. The techniques are very general and powerful, but unfortunately undershoot the state of the art in the workedout examples.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Over the last few years, Rakhlin, Shridharan and collaborators have built up a learning theory for online learning based on sequential Rademacher complexities, which can characterize the achievable minimax rates for many function classes. So far, however, this theory has not been able to capture adaptive rates, which are nonuniform over the model and over "nice" data sequences. This is the contribution of the present paper: it extends the sequential Rademacher complexities framework to prove adaptive rates. Like for the nonadaptive case, the theory is nonconstructive: it does not immediately give any efficient algorithms to achieve the rates.
Technically the new results are obtained by using onesided concentration inequalities to control an empirical process that involves the Rademacher complexities and the adaptive rate of interest.
Significance
Since adaptive rates are recently receiving significant attention, a better understanding of which adaptive rates are possible is an important contribution. A strong point of this work is that it obtains a very diverse selection of bounds.
In this regard, I believe that Example 4.1 (about the spectral norm) and Corollary 7 (an abstract result for model selection) are new. Example 4.2 recovers a bound by [15], and Example 4.3 and 4.4 are in the spirit of results by [9], but go further.
Example 4.5 is also definitely interesting, but the paper's discussion of its relation to the literature is not quite clear to me. In particular, it is claimed that this example subsumes and improves upon the results of [5], but the term sum_{t=1}^n \langle f,y_t \rangle in the bound is a smallloss/L*type bound, which is different (and probably weaker) than the cumulative magnitude C_T of instantaneous regrets used in [5]. The dependence on n is also worse in Example 4.5 compared to [5], where it is O(log log n).
Presentation and correctness
The paper is wellwritten, but the proofs sometimes jump over some details that might be important:
Lemma 1:
* the proof says that additional mild assumptions (like compactness)
are required to swap the min and the max, so please state these in
the lemma.
* The proof of (5) assumes that swapping a derivative and an
expectation is allowed. Please provide a reference or state as an
assumption.
Proposition 2:
* In line 780 it seems that you choose delta_i = B_i(theta_i  1).
If not, line 780 is not clear to me. If so, please state this
explicitly.
* You exclude delta from the final bound based on the reasoning that
it can be chosen arbitrarily small. But this is not possible if you
fix delta_i = B_i(theta_i  1). And if your reasoning is correct after all, why
don't you simply set delta_i=0 from the start?
Minor comments:
* Throughout the paper, you handle the settings where x is absent by
setting cal{X} = emptyset. But, formally speaking, nature cannot
choose from an empty set, and I don't know how functions from the
empty set to the reals are defined. So perhaps you should set cal{X}
to be a singleton set instead.
* In the displays in section 1.1, I think you should add expectations
of hat{y}_t under q_t.
* In section 1.2: quantile bounds also depend on the data, because
the epsilonquantile is the set of experts with cumulative loss
within epsilon of the best expert.
* line 221: please also define this for p=infty, because that's what
you use it for.
* Lemma 4: R_n(G) > 0 is not possible, because you did not divide by
n in its definition in this paper.
* Lemma 5: N(G,delta) is for which p?
* Corollary 6: "small constant" is not a formal statement, so cannot
be part of a corollary.
* line 594 has a typo in the argmin
* line 611: say you are scaling the difference of derivatives by
1/(2L) to obtain s_t in [1,+1].
Q2: Please summarize your review in 12 sentences
Since adaptive rates are recently receiving significant attention, a better understanding of which adaptive rates are possible is an important contribution. A strong point of this work is that it obtains a very diverse selection of bounds.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper considers the problem of deriving adaptive regret bound in online learning framework, which can be categorized into three types: 1) depends only on f, 2) only on data, and 3) on both quantiles.
As the paper claimed, the paper proposes a general framework which recovers the vast majority of known adaptive rates in literatures, including variance bounds, quantile bounds, localizationbased bounds, and fast rates for small losses.
The paper is wellwritten and clearly stated the main points.
The paper is a theoretical paper, which is beyond the scope of the reviewer's expertise.
The reviewer is concerned with the following issues:
1) Originality of the claimed results:
What specific tools are proposed and devised to derive the adaptive regret bounds are blur to the reviewer, which makes the reviewer conjecture some results are reinvented the wheels by using the existing results of concentration inequality, the sequential Rademacher complexity, and the offset processes.
2) Significance: How the theoretical results connect with concrete online learning algorithms is unknown.
The paper seems assuming a general oracle algorithm exists, then derives the corresponding adaptive regret bound.
How the theoretical results guide concrete online learning algorithms is unknown.
Given the bounds, what can be improved is unknown.
3) No empirical evaluation makes it hard to estimate the correctness of the bounds and give some visual results.
Minor comments: 1. In pp. 2, line 63, "the the right"
2. In pp. 2, line 074, X=\emptyset seems not correct as x_t belongs to X.
Q2: Please summarize your review in 12 sentences
The paper is a theoretical paper to propose a general framework for adaptive regret bounds, including modelselection bounds and datadependent bounds, in the online learning framework.
No empirical evaluation is presented.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank all the reviewers for the useful comments
and suggestions.
Let us address the following main issues raised
by the reviewers: 1. correct scaling in Examples 4.4 and 4.5 (Reviewers
1 and 2) 2. computationally efficient methods (Reviewers 4 and
6) 3. originality, significance (Reviewer 3)
Regarding issue #1,
it is indeed the case that we did not match the scaling in some of the
regimes in examples 4.4 and 4.5. This stems from the fact that the main
result is very general (Corollary 6 holds well beyond finite or countably
infinite F), and we were not careful about the logarithmic factors for
``small classes''. We went back to the proofs and were able to remove
these extraneous terms (we will make the modification in the final draft)
in Example 4.4.
More specifically:
a. ``The bound of
Example 4.4 should be compared to Theorem 1 and Corollary 1 from the
NIPS'14 paper ``Exploiting Easy Data in Online Optimization'' by Sani, Neu
and Lazaric ...'': Thank you for pointing out the reference, we were
unaware of this work. A simple modification of Corollary 6 (specifically
catered to finite function class case) yields constant regret against f*
and a O(sqrt(n log N) (log n + log log N) ) against arbitrary benchmarks.
b. ``The bound of Example 4.5 should be compared to Theorem 4
(which trivially extends from subsets to mixtures) from the COLT'15 paper
``Secondorder Quantile Methods for Experts and Combinatorial Games'' by
Koolen and Van Erven''. Thank you for the reference  we only learned
about this paper after the submission. Let us mention a few points of
comparison of our work with regard to this paper, as well as [13] by Luo
and Schapire. The bound of Koolen et al is Nindependent, but is in terms
of the prior (that is, MDLstyle). The bound of [13] is in terms of KL (as
in PACBayes), yet has N dependence. Note that the bound of Example 4.5
has no dependence on N and depends on the prior through KL, so it is a
bona fide PACBayesian bound; as such, it can be used for uncountably
infinite classes. Second, both 4.4 and 4.5 are algorithmindependent
regret bound.
We also remark that in Example 4.5 (specifically,
the proof of Lemma 9), we obtained the L* bound starting from a quadratic
variation term. Keeping this quadratic term resolves, at least
nonconstructively, the open question of CesaBianchi, Mansour, Stoltz '06
about an algorithmindependent variance bound for experts. Example 4.3,
however, goes well beyond this open question and gives a spectrum of
variance bounds for rich classes.
In sum, the extraneous log
factors appeared in our original submission because the main result was
catered to work well beyond the finite case (this finite case is the focus
of the references provided by the reviewers).
Regarding issue #2,
it is indeed the case that our results are nonconstructive. However, in
the last part we outline a generic recipe for obtaining algorithms. The
original work on sequential complexities was also nonconstructive, but
the understanding of correct complexities led to many novel algorithms in
subsequent papers. The development of computationally efficient adaptive
methods is surely an exciting area of further
investigation.
Regarding issue #3, with respect to the novelty, we
respectfully disagree: we introduce online PAC bayes theorems, provide
variance bounds well beyond the finite class, address an open question by
CesaBianchi et al, and, more importantly, show that all of these follow
with a unified analysis. Our work lays out a unifying approach for
theoretical analysis of adaptive online learnability which we term
``achievability''. 
