Paper ID:598
Title:Incremental Local Gaussian Regression
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper extends Locally Weighting regression (LWR) where the loss function is not a weighted combination of local models but individual data points are weighted by each model. By putting a Gaussian prior on the coefficients of the models, they end up with the local models as Gaussian processes (GP). First the batch version of the algorithm is presented where the centers of the local models is assumed to be fixed. Then, an incremental version is also introduced (in a Bayesian spirit where old posteriors are considered to be the new prior) where new models can also be added.

Quality and Clarity

The paper is well written, easy to read. The transitions from LWR to LGP is well introduced.

The experiments present results from a robotic setup by learning the inverse dynamics of a SARCOS arm a KUKA arm. The results show that LGP outperforms LWPR that is considered to be one of the best state-of-the-art methods in inverse dynamics learning. Furthermore, better performance is achieved with less local models.

The authors mention article 'Local Gaussian process regression for real time' [16] but the connection between the presented method is loosely discussed. Comparison with [16] at the level of the experiments would be also very interesting.

Originality and Significance

The new weighting of the local models seems to be original and the experimental results highlight the significance of the presented method.
Q2: Please summarize your review in 1-2 sentences
The paper extends Locally Weighting regression (LWR) with a different weighting of the local models leading to local Gaussian models. Results show that the presented method outperforms the state-of-the-arm algorithm.

Submitted by Assigned_Reviewer_20

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors propose a regression scheme based on locally weighted regression (LWR). By making links from LWR to Gaussian process (GP) regression, a probabilistic model is formed. The model initially appears to have scalability like a GP (cubic in N) but a variational scheme rescues the method. The updates in the variational scheme mimic the LWR procedure, recovering scalability of the algorithm.

Clarity
--
The paper is very well written. It's easy to follow, the notation is simple and consistent, and sufficient detail is presented without being overwhelming. The link from LWR to GP regression is especially neat. Although it's clear that any individual model in the LWR scheme comprises a GP (as any linear Gaussian model), it seesm novel to interpret the entire methodology as such.

If one section of the paper is unclear, it is section 4. The authors may protest constraints on space, but this section seems somewhat anecdotal in comparison to the thoroughness of the rest of the paper. The presentation of the Algorithm is a help, I suppose.

Quality and significance
--
The paper is technically sound, is of interest to a large section of the NIPS community, and contains solid experiments. The experiments chosen represent interesting challenges and the proposal appears to make a good improvement in terms of speed whilst maintaining accuracy.

I would really like to see a mention of the availability of the implementation, which would enhance the paper further.

I have one quality related complaint: the probabilistic nature of the algorithm is not explored in the experiments. I would have like to have seen the average log-density of held -out data alongside the MSE. Whilst the LWR method might not provide probabilistic estimates, the proposed method and the SSGPR will. Surely in a robotics environment, where decisions have to be made under uncertainty, log p(y*) is a more informative measure than MSE? It is widely known that different GP approximations perform very differently in terms of predictive density (e.g. FITC usually provides conservative predictive density): perhaps the authors could provide a supplementary table with the log density scores?

Queries.
--
To make the variational approximation tractable, you have to introduce uncertainties via the parameters beta. In practise, what did these converge to ? Does this slight change of model have a strong impact?

The variational updates for the local models are independent, but the beta parameters are global: does this make the mode computationally costly? Do you interlace fewer of these updates with the local updates?

Table 2: The SSGPR method was pre-trained with 200 features, but the LWGPR method was allowed to use around 500 local models. Would it be fair to say that both models are of the same complexity? Does the SSGPR method not do better with more that 200 features: I seem to recall that the method scales cubically in the number of features, so could you not have afforded a few more? In table 3 the discrepancy is more severe, I guess due to the offline-online differences in the procedures.

Summary
--
A very well presented paper, enjoyable to read. I have a few technical questions, and my overall score may change depending on the authors rebuttals.
Q2: Please summarize your review in 1-2 sentences
Great presentation, relevant topic, good experiemtns let down by lack of probabilistic quantities in the results.

Submitted by Assigned_Reviewer_29

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper describes a method for non-parametric model learning from incremental data. The example motivation is learning manipulator models in robotics, where the models are often high-dimensional and highly non-linear. The approach is to apply Gaussian regression over a series of smaller patches (which are also learned by pruning and greedy addition). The key innovation over current techniques is that the proposed method produces a generative probabilistic model and requires little parameter tuning.

The paper presents convincing test data that their method achieves its goals.

As minor suggestions for improvement, the authors hint at other benefits of using a fully probabilistic/generative model, it would be nice to explicitly state why a method that is fully probabilistic is important from a controls perspective. Also, it seems like the forgetting and learning rates of LGR are tuning parameters. Is there any way to show or quantify how many fewer and/or how much less sensitive the parameters of the proposed are compared to the state of the art techniques?
Q2: Please summarize your review in 1-2 sentences
This paper is well written and organized. The need for efficient non-parametric fitting of robotics data is convincing, as is the argument about less need for tuning.
Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Many thanks to the reviewers for their constructive comments.

reviewer_1
==========
Many thanks for your positive review. We see [16] ('Local Gaussian Process regression for real time') connected to LWR in the sense that in [16] linear models are swapped with GPs. In essence this is also a bottom-up approach, that trains each local model independently of each other and only combines them at prediction time. Thus on a global scale this approach does not have a generative model, similar to LWR/LWPR. We agree that a experimental comparison would be useful, and would have incorporated one if a publicly available implementation existed.

reviewer_20
===========
Thank you very much for your thorough review.

We will try to improve clarity of Section 4, by incorporating more details - potentially in an appendix/supplementary material to overcome space limitations.

availability of implementation: We are currently cleaning up robust python and C++ implementations, and are planning to release this code in time for NIPS, alongside the paper. This will be mentioned in the updated version.

probabilistic nature, predictive log probability: The reviewer is making a good point about leveraging the probabilistic nature of LGR (or SSGPR). This aspect should certainly be evaluated in the future. Our paper does use the less "direct" strengths of the probabilistic formulation, however: Propagation of model flexibility through a hierarchical formulation; consistent parameter choices, and natural capacity control.

Change of model through new beta parameters: The model changes slightly in
that in addition to a global precision variable, we now have a precision per
local model - allowing to model differences in noise behaviour. This also adds some redundancy - and thus in practice we have found that LGR converges faster if one level of noise is fixed to a small value - allowing the other level to absorb all the noise. To keep maximal flexibility we have chosen to keep beta_y fixed and only update the beta_fm in the presented experiments.

Convergence of beta parameters: On the SARCOS data the beta_fm converge to values between ~1000 and ~15000. Keep in mind that the final predictive variance is a sum of the variance components of all activated local models.

cost of beta updates: You are correct the beta updates are on a global scale, however they can be performed in O(MK), where M is the number of local models and K is the dimensionality of the local models. Thus it is not costly, and no interlacing was necessary in any of the experiments.

SSGPR # of features: For experiments on the SARCOS (Table 2) data we followed the experimental set up of [13], which used a maximum of 200 features (on the SARCOS data) with which a performance on par to GPR was reached (Section 4.2.1 in [13]). We were able to reproduce these results and thus saw no reason to increase the number of features. However, we are happy to include experiments with more features.

On the KUKA data sets (Table 3) we have tried to double the # of features to 400 during the offline training phase, but instead of improving results it actually decreased the online performance - probably due to the nature of the experimental setup (offline training data has only partial/almost no overlap with online training). We already mention this shortly in the paper - but can potentially include the actual additional results (depending on space constraints).

reviewer_29
========
Many thanks for your great review.

You are right that forgetting rate/learning rate are parameters that still require some manual tuning for optimal results (especially in terms of convergence speed). We agree that a evaluation of sensitivity to these parameters could prove useful, not only to compare to other methods but also to potentially arrive at guidelines for setting these parameters in LGR. We are hoping to include this form of analysis in an extended version of this paper.