
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proposes a supervised learning algorithm. It uses stochastic gradient descent and periodically expands the hypothesis space by introducing new basis functions and adding corresponding components to the weight vector. As such, as it processes more data, it fits more complex models. The hypothesis space considered here are polynomials and higher order monomials are gradually introduced to the model.
The concept of growing the hypothesis space as more data is introduced is not new (training kernel methods with SGD exhibits this behavior), but in the proposed method, choosing which monomials to add to the hypothesis space is very cheap. The method selects monomials by inspecting the components of the weight vector instead of, say, selecting the monomial that would most reduce the training error (as a matching pursuit variant of the method might).
The experiments are thorough and convincing. The method is nearly as fast as fitting a quadratic polynomial, yet just as accurate as fitting a cubic polynomial.
The regret bound and its proof are a treat to read. It is pleasant that the math works out so that more weight is placed on later comparators.
The authors might clarify that the regret bound offers no guidance on how to choose the monomials to add. Nor does it offer guidance on when to expand the hypothesis space. Rather it states that any nested hypothesis spaces of this kind (see my note about "this kind" below) do not have high regret.
Some suggestions:
1) The claim of Theorem 2 can be slightly generalized. [.]_S(t) picks the components of . in the support set S(t). But the proof goes through if [.]_S(t) is more generally a projection on a linear subspace S(t). So Theorem 1 holds for any sequence of linear subspaces S(t) whose dimensionality expands with t.
2) The two graphs of Figure 2 are informative. But the following combination of these two graph would have been more helpful to me: draw one curve per algorithm as you currently do; let the xaxis be relative time (exactly as you already compute it in Figure 2(a)), and let the yaxis be the number of datasets on which the algorithm obtains the best score among the other algorithms under that running time.
[more detail: all of the algorithms being considered refine their solution over time. long below they satisfy their stopping criteria, each produces a useful result. i want to know how many seconds apple requires to beat its competition. i don't need to know its relative error, but rather whether it beats all of its competitors by time T. the graph would tell me how many datasets it dominates at time T.]
[in the interest of reproducible research, i am also upping my score by +1 because i noticed this technique is publicly available in VW now.] Q2: Please summarize your review in 12 sentences
The experiments are thorough and convincing. The regret bound and its proof are a treat to read. Submitted by Assigned_Reviewer_24
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proposes a scalable algorithm call ``apple'' for nonlinear learning. The authors take a heuristic rule to adaptively expand the feature set with interaction features and empirically show the effectiveness. Moreover, for a specific objective function, under some handcraft parameter settings, the authors provide a regret bound with theoretical guarantee.
Although the authors illustrate the feature expansion heuristics with a simple example (see Section 2.3), the reason for the performance of the proposed algorithm is not clear for realworld problems. Also, I cannot see connections between the regret bound (Theorem 1) and the empirical evaluation. There are lots of previous work in the literature. It would be better to see the comparison between ``apple'' and recently established algorithms, e.g. [16, 17], although they are proposed in batch manner.
Pros:  The paper proposes an online algorithm to adaptively learn nonlinear features, making it scalable to large scale problems.  Under specific settings, the algorithm enjoys a remarkable regret bound and convergence rate.  Experimental results show that ``apple'' achieves a good tradeoff between accuracy and efficiency.
Cons:  Analysis on the feature expansion heuristic for the realworld problems is not clear.  Some recently presented algorithms are not compared in the experiments.  The connection between the theoretical guarantee on the regret bound and the empirical evaluation is not intuitive.
Quality: It is promising to make the nonlinear learning scalable to large scale problems in an online fashion. But the analysis for the algorithm, especially the feature expansion heuristic, is not satisfactory.
Clarity: The authors should further clarify the difference between their work and previous work in the literature, e.g. [1418].
Originality: The differences from the previous work are not clearly stated.
Significance: There is a comprehensive study on the proposed algorithm and the baselines. The empirical evaluation justifies the superiority of ``apple'' over the others. However, the author do not compare their algorithm with recently established work.
Other comments:  In line 132, ``large but finite maximum degree'', do you predefine the degree? What is the actual degree in the experiments?
Q2: Please summarize your review in 12 sentences
The proposed online feature expansion algorithm is suitable for large scale problems. However, the reason why it works well is not clear. Submitted by Assigned_Reviewer_41
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proposes a technique to learn a subset of higherorder interaction terms to efficiently add nonlinearities to base linear (monomial based) representations. They compare their algorithm to linear/quadratic/cubic polynomial in terms of speed as well as accuracy. The idea seems reasonable and straightforward:iteratively add interaction terms between the most significant (as defined by the highest weighted coefficients).
The paper is reasonably clear. They motivate the problem well, and describe some alternative approaches. The algorithm description is also good, and provides sufficient information to allow the work to be replicated. The section discussing the regret bound could be improved however  they use terms (such as shifting comparators), which are not defined and may be unfamiliar to those who are not very familiar with some of the references they cite. Some of the formula in section 2.2 use terms that are not defined previously. It would also be helpful if they included additional details as to why they consider Theorem 1 to be remarkable (I did not understand the phrase "no dependence on the cumulative shifting of the comparators").
The core idea seems to make sense, and is quite natural. They produce experimental evidence to suggest that it can be more effective (in terms of accuracy) than not using any higher order interaction terms, and faster than say a fixed degree polynomial (such as a cubic). However, while it does seem to be better than using a fixed set of 2nd order interaction terms, the distinction between the apple algorithm and the quadratic polynomial does not seem to be very big, though they show that the difference may be more marked in some of the datasets where the quadratic expansion is more expensive.
Q2: Please summarize your review in 12 sentences
The paper describes an algorithm which seems to be a relatively straightforward extension of current practice. The paper is reasonably well written, but Section 2.2 could improved to avoid using terms without definition that many of the readers may be unfamiliar with. Finally, while the authors provide summary comparisons of different algorithms across multiple datasets, it would be helpful to have dived deeper in cases where (for example) the fixed quadratic polynomial expansion outperformed apple (in terms of accuracy), and understand why that was the case (larger dataset? more features?). Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point. We thank the reviewers for their feedback. We have incorporated their suggestions and addressed their comments in our current draft.
Reviewer_2:
Thanks for your suggestion regarding the regret bound. You are also right to observe that it does not, by itself, offer any guidance in basis expansion; we clarify this point in the revision.
Regarding plots, we are not sure we fully follow the suggestion. Can you clarify?
Reviewer_24:
The baselines used in our evaluation were justifiably subject to the same computational constraints we place on our approach. This rules out the alternatives in [16,17] suggested by the reviewer, as well as nearly all others in the literature. Given the computational focus of our work, we hope that the reviewer will recognize the fairness of comparison with similarly constrained alternatives. Without computational concerns, sparse regression on any reasonable polynomial expansion is a statistically sound idea for sparse polynomial learning, making the problem relatively trivial.
Although the original methods from [15,17] are computational intractable with large feature sets, we did implement an efficient variant inspired by [15] and related to [17] (the SSM heuristic in Sec. 2.3). As evident in Fig. 2, this heuristic was dominated by our approach and hence omitted from later comparisons for visual clarity.
At present, it is unreasonable to demand a full, rigorous analysis of a practically scalable nonlinear learning algorithm. Indeed, no other approach has reached this bar, and most only provide either empirical justification or theoretical analysis, but not both. Our approach has some theoretical backing, as well as an extensive empirical evaluation. We emphasized the latter because it offers cues to where future theoretical efforts should focus.
The degree of the final polynomials is bounded by the number of epochs, which is six in our experiments.
Reviewer_41:
Most of our terminology is standard in the online learning literature, and we were forced to omit some of these explanations due to the space constraint. The full version of the paper discusses these terms in more detail. A list of other confusing terms would be greatly appreciated.
Simplicity in hindsight is an important feature of our approach. Computational constraints make many complex approaches intractable. Moreover, many other equally simple and straightforward alternatives turn out to fail catastrophically. One such example is the discussion around testing the quality of a monomial before inclusion in the last paragraph of Page 7. Although this is a natural idea, it ends up being computationally intractable. We believe that a simple approach which convincingly beats standard (and strong) baselines is actually quite deserving of publication.
The discussions in Sec. 3.2 that reference Fig. 3 and Fig. 4 provides some answers to your question regarding the performance of nonadaptive expansions. On these larger datasets, our algorithm adaptively finds monomials that reduce the error more economically (in terms of computation time) than the nonadaptive expansions.
 