Paper ID: 1109 Title: Scalable Non-linear Learning with Adaptive Polynomial Expansions
Current Reviews

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proposes a supervised learning algorithm. It uses stochastic
gradient descent and periodically expands the hypothesis space by
introducing new basis functions and adding corresponding components to
the weight vector. As such, as it processes more data, it fits more
complex models. The hypothesis space considered here are polynomials
and higher order monomials are gradually introduced to the model.

The concept of growing the hypothesis space as more data is introduced
is not new (training kernel methods with SGD exhibits this behavior),
but in the proposed method, choosing which monomials to add to the
hypothesis space is very cheap. The method selects monomials by
inspecting the components of the weight vector instead of, say,
selecting the monomial that would most reduce the training error (as a
matching pursuit variant of the method might).

The experiments are thorough and convincing. The method is nearly as
fast as fitting a quadratic polynomial, yet just as accurate as
fitting a cubic polynomial.

The regret bound and its proof are a treat to read. It is pleasant
that the math works out so that more weight is placed on later
comparators.

The authors might clarify that the regret bound offers no guidance on
how to choose the monomials to add. Nor does it offer guidance on when
to expand the hypothesis space. Rather it states that any nested
hypothesis spaces of this kind (see my note about "this kind" below)
do not have high regret.

Some suggestions:

1) The claim of Theorem 2 can be slightly generalized. [.]_S(t) picks
the components of . in the support set S(t). But the proof goes
through if [.]_S(t) is more generally a projection on a linear
subspace S(t). So Theorem 1 holds for any sequence of linear subspaces
S(t) whose dimensionality expands with t.

2) The two graphs of Figure 2 are informative. But the following
combination of these two graph would have been more helpful to me: draw one curve per
algorithm as you currently do; let the x-axis be relative time (exactly as you
already compute it in Figure 2(a)), and let the y-axis be the number of datasets
on which the algorithm obtains the best score among the other algorithms under
that running time.

[more detail: all of the algorithms being considered refine their solution over time. long below they satisfy their stopping criteria, each produces a useful result. i want to know how many seconds apple requires to beat its competition. i don't need to know its relative error, but rather whether it beats all of its competitors by time T. the graph would tell me how many datasets it dominates at time T.]

[in the interest of reproducible research, i am also upping my score by +1 because i noticed this technique is publicly available in VW now.]
The experiments are thorough and convincing. The regret bound and its proof are a treat to read.

Submitted by Assigned_Reviewer_24

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

The paper proposes a scalable algorithm call ``apple'' for non-linear learning. The authors take a heuristic rule to adaptively expand the feature set with interaction features and empirically show the effectiveness. Moreover, for a specific objective function, under some hand-craft parameter settings, the authors provide a regret bound with theoretical guarantee.

Although the authors illustrate the feature expansion heuristics with a simple example (see Section 2.3), the reason for the performance of the proposed algorithm is not clear for real-world problems. Also, I cannot see connections between the regret bound (Theorem 1) and the empirical evaluation. There are lots of previous work in the literature. It would be better to see the comparison between ``apple'' and recently established algorithms, e.g. [16, 17], although they are proposed in batch manner.

Pros:
- The paper proposes an online algorithm to adaptively learn non-linear features, making it scalable to large scale problems.
- Under specific settings, the algorithm enjoys a remarkable regret bound and convergence rate.
- Experimental results show that ``apple'' achieves a good tradeoff between accuracy and efficiency.

Cons:
- Analysis on the feature expansion heuristic for the real-world problems is not clear.
- Some recently presented algorithms are not compared in the experiments.
- The connection between the theoretical guarantee on the regret bound and the empirical evaluation is not intuitive.

Quality:
It is promising to make the non-linear learning scalable to large scale problems in an online fashion. But the analysis for the algorithm, especially the feature expansion heuristic, is not satisfactory.

Clarity:
The authors should further clarify the difference between their work and previous work in the literature, e.g. [14-18].

Originality:
The differences from the previous work are not clearly stated.

Significance:
There is a comprehensive study on the proposed algorithm and the baselines. The empirical evaluation justifies the superiority of ``apple'' over the others. However, the author do not compare their algorithm with recently established work.

- In line 132, ``large but finite maximum degree'', do you pre-define the degree? What is the actual degree in the experiments?
The proposed online feature expansion algorithm is suitable for large scale problems. However, the reason why it works well is not clear.

Submitted by Assigned_Reviewer_41

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proposes a technique to learn a subset of higher-order interaction terms to efficiently add nonlinearities to base linear (monomial based) representations. They compare their algorithm to linear/quadratic/cubic polynomial in terms of speed as well as accuracy. The idea seems reasonable and straightforward:iteratively add interaction terms between the most significant (as defined by the highest weighted coefficients).

The paper is reasonably clear. They motivate the problem well, and describe some alternative approaches. The algorithm description is also good, and provides sufficient information to allow the work to be replicated. The section discussing the regret bound could be improved however - they use terms (such as shifting comparators), which are not defined and may be unfamiliar to those who are not very familiar with some of the references they cite. Some of the formula in section 2.2 use terms that are not defined previously. It would also be helpful if they included additional details as to why they consider Theorem 1 to be remarkable (I did not understand the phrase "no dependence on the cumulative shifting of the comparators").

The core idea seems to make sense, and is quite natural. They produce experimental evidence to suggest that it can be more effective (in terms of accuracy) than not using any higher order interaction terms, and faster than say a fixed degree polynomial (such as a cubic). However, while it does seem to be better than using a fixed set of 2nd order interaction terms, the distinction between the apple algorithm and the quadratic polynomial does not seem to be very big, though they show that the difference may be more marked in some of the datasets where the quadratic expansion is more expensive.
The paper describes an algorithm which seems to be a relatively straightforward extension of current practice. The paper is reasonably well written, but Section 2.2 could improved to avoid using terms without definition that many of the readers may be unfamiliar with. Finally, while the authors provide summary comparisons of different algorithms across multiple datasets, it would be helpful to have dived deeper in cases where (for example) the fixed quadratic polynomial expansion outperformed apple (in terms of accuracy), and understand why that was the case (larger dataset? more features?).
Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their feedback. We have incorporated their suggestions and addressed their comments in our current draft.

Reviewer_2:

Thanks for your suggestion regarding the regret bound. You are also right to observe that it does not, by itself, offer any guidance in basis expansion; we clarify this point in the revision.

Regarding plots, we are not sure we fully follow the suggestion. Can you clarify?

Reviewer_24:

The baselines used in our evaluation were justifiably subject to the same computational constraints we place on our approach. This rules out the alternatives in [16,17] suggested by the reviewer, as well as nearly all others in the literature. Given the computational focus of our work, we hope that the reviewer will recognize the fairness of comparison with similarly constrained alternatives. Without computational concerns, sparse regression on any reasonable polynomial expansion is a statistically sound idea for sparse polynomial learning, making the problem relatively trivial.

Although the original methods from [15,17] are computational intractable with large feature sets, we did implement an efficient variant inspired by [15] and related to [17] (the SSM heuristic in Sec. 2.3). As evident in Fig. 2, this heuristic was dominated by our approach and hence omitted from later comparisons for visual clarity.

At present, it is unreasonable to demand a full, rigorous analysis of a practically scalable non-linear learning algorithm. Indeed, no other approach has reached this bar, and most only provide either empirical justification or theoretical analysis, but not both. Our approach has some theoretical backing, as well as an extensive empirical evaluation. We emphasized the latter because it offers cues to where future theoretical efforts should focus.

The degree of the final polynomials is bounded by the number of epochs, which is six in our experiments.

Reviewer_41:

Most of our terminology is standard in the online learning literature, and we were forced to omit some of these explanations due to the space constraint. The full version of the paper discusses these terms in
more detail. A list of other confusing terms would be greatly appreciated.

Simplicity in hindsight is an important feature of our approach. Computational constraints make many complex approaches intractable. Moreover, many other equally simple and straightforward alternatives turn out to fail catastrophically. One such example is the discussion around testing the quality of a monomial before inclusion in the last paragraph of Page 7. Although this is a natural idea, it ends up being
computationally intractable. We believe that a simple approach which convincingly beats standard (and strong) baselines is actually quite deserving of publication.

The discussions in Sec. 3.2 that reference Fig. 3 and Fig. 4 provides some answers to your question regarding the performance of non-adaptive expansions. On these larger datasets, our algorithm adaptively finds monomials that reduce the error more economically (in terms of computation time) than the non-adaptive expansions.