NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6333
Title:Gradient based sample selection for online continual learning

Reviewer 1


		
This paper proposes an approach to optimally select samples for a small replay buffer to perform continual learning (CL) without forgetting. Like previous work (eg. GEM/A-GEM) the problem is formulated from the perspective of constrained optimisation (minimise loss on current sample subject to loss not increasing on previous ones). Unlike GEM, with clear separation and knowledge of tasks, this approach addresses the general non-stationary learning problem. The paper proposes a theoretical argument for using the variance of gradients to select samples for the buffer. One related work that could also be cited is "Adapting Auxiliary Losses using Gradient Similarity", by Du et al, 2018 (https://arxiv.org/abs/1812.02224) I think this is a novel idea that is well worth exploring and could be a nice publication, but have a number of questions and concerns that I believe need to be addressed first. On the technical side: - What if no feasible region exists? Eg. for a binary classification problem, two [almost] identical samples with label noise would be enough for the feasible region C to be the empty set (as the gradients would be exactly opposed). - One concern I have is with the successive approximations made. We want Ctilde to match C as closely as possible, this is then relaxed to Ctilde being as small as possible --> minimising solid angle --> minimising the surrogate objective. Fig 2 shows the accuracy of the last step (and in fact, how well does this work for a bigger range of angle/surrogate values?); but how does this translate to approximation error in the original objective? Which constraints are typically satisfied or violated (ie. are there qualitative differences in the kinds of samples that are remembered and forgotten)? - Given an impetus to minimise "size" how do we stop Ctilde choosing a [close to] empty set (eg. Two opposing constraints from noisy data as suggested above) - The iCARL CIFAR-10 results reported here seems poorer than even the CIFAR-100 result (ie. a much harder task) reported in the original paper - where does this discrepancy arise? - Can you quantify the computational improvement of GSS-greedy over the others? This is just stated in passing in the text in lines 220-222. - A number of the reproducibility checklist options haven't been satisfied - things like error bars / uncertainty, hardware used for experiments, specification of evaluation protocol, etc. Most of these are easy, and I think uncertainty over multiple runs is important (at the very least for external comparisons) - Additional benchmarks would be nice in order to contextualize these results - currently it compares against GEM and iCARL. On the writing side: - The related work section is extremely brief, and the paper needs to be better positioned, first within CL, and then constraint-based CL. - The intro seems to be quite critical of prior-focused methods and needs more citations, and possibly toned down language (some of it seems emotive, such as "enjoy the beauty of...", "Prior-focused method has bad performances..." - Figure 1 is not immediately clear, I would show the blue constraint on just the left hand plot to indicate it is being removed on the right. (Showing on a plot something that is removed is a bit difficult to follow) - I'm still not clear on what iid online (as opposed to offline) refers to. Is it just training a single model on iid data for the same number of steps as the incremental setting? If so, why not start the curves at zero as well for consistency? - While the intro motivates the approach broadly as relaxing i.i.d assumptions within each task (lines 46-50), the only experiments performed are with a sequences of tasks, each shown i.i.d. As such, I think the motivation / claims need to be narrower, or additional non-stationary scenarios evaluated. Other minor points: - Avoid conversational or imprecise language, eg. "Way smaller" on line 118, and "this exact" in 129. - There are also some minor grammar issues throughout (such as "bad performances" in line 52), so please double check for consistency. - Line 151: "as shown in Suppl" doesn't make sense. - Is "disjoint MNIST" not the same as Split MNIST? If so, I'd change it to be consistent POST-REBUTTAL: After looking at all of the reviews and the rebuttal, I don't think my position has really changed. Our concerns seem to be mainly that it would be nice to have more experiments on some other standard benchmarks, as well as some additional discussion and analysis on the high-level picture, particularly in terms of how the different steps and approximations behave and interact. I feel this is a nice idea, but there's more the authors could do to make it even more convincing. As such, I'm thinking of staying with my original score of 6.

Reviewer 2


		
**** I thank the authors for the response. My concerns have been partly addressed and thus I will increase my score. However, the submission would benefit from a round of clean-up: clearly state how each approximation step works and how well this works, the relationship to existing methods, and add some existing benchmarks (if possible) as this is key for community adoption/acceptance. **** Originality: The formulation of the sample selection as a feasible region selection seems sensible. The proposed IQP and greedy algorithms seem sound. However, it is unclear how well these actually work theoretically in high dimensional space and when the buffer size is large. The interpretation of the proposed surrogate function as maximising the diversity of the samples in the replay buffer is interesting but begs the question that there is some similarity between the proposed methods and existing coreset selection algorithms, for example: the K-center algorithm in "Teofilo F. Gonzalez. Clustering to minimize the maximum intercluster distance", or more recent coresets algorithms in "Bachem et al. Practical coreset constructions for machine learning". Quality: It is challenging to judge whether the proposed method actually performs well in a more realistic continual learning setting, due to the difference between the experiments considers and existing benchmarks. The number of random runs considered is also quite small and that the result tables do not include error bars (it would be better to convert these tables into figures, as it is much easier in general to understand the real difference between methods -- there are more than just one single bold number). It is not clear how well the empirical surrogate work in high dimensions, or that how accurate the proposed greedy strategy is. It would also be interesting to have a more in-depth discussion/comparison to existing methods such as coverage maximisation as briefly mentioned on line 72. The paper seems to criticize methods that combine episodic memory/experience replay with parameter based regularisation (in the intro) -- but isn't the final algorithm considered in this paper a form of such algorithm? [the line is pretty blurred, but the buffer can be used for objective/gradient regularisation or replay as discussed in 3.6]. In general, I also feel there is not enough high level picture on how the algorithms are derived and to glue different subsections together, which makes it hard to read. The claim in the conclusion "efficient and constantly outperforming other selection strategies" seems very strong given the mixed performance of both IQP and greedy in many tasks considered. Clarify: The writing is clear, in general. However, there are presentation styles in the experiments that make things quite hard to read to me. For example, the tables are not in order, the subheadings in 4 and 4.4 are hard to read/look strange. In figure 2, what is "200D log scale". Line 129: "this exact in 2D with figure 3"?.

Reviewer 3


		
Originality: This paper proposes a new method for sample selection in continual learning, which is based on GEM and reformulates the problem from the perspective of constrained optimization. Comparing with other related works, the experiments in this paper have been conducted in a different setting of continual learning, that is, the task boundary is unavailable and the model can only iterate on a mini-batch for a few times. This setting is worth to bring out as it is more likely in the real world. Quality: The paper is well written and easy to follow. In my understanding, the intuition is straightforward and the formulation is reasonable. In the experiments, the proposed methods have been compared with several different strategies of sample selection as well as some existing methods (i.e. GEM and iCaRL) by average accuracy. It could be more complete by comparing Backward Transfer with GEM since GEM has shown an advantage in it and this work is based on GEM. Clarity: The problem setting and formulations are clear. However, there are still some issues I would like the authors to clarify a bit: 1. Is it possible to give a consistency analysis for the greedy method (Alg. 2)? As increasing n to M, is it equivalent to the IQP method? 2. What's the value of n you used in the greedy method in your experiments? Is this value critical to the performance of the greedy method? 3. You used a batch size of 10 samples in all experiments, which is quite small, did you try other sizes? Would a larger size perform better or not? 4. The results were averaged over 3 runs, but how about the standard deviation? Does the greedy method have a larger variance than other methods? 5. In Tab. 3 T4, the random selection has much higher accuracy than others, do you have any insights on this? Significance: The proposed methods show promising results compared with some state-of-the-art approaches and the greedy one shows practical feasibility for large scale buffers. The paradigm of continual learning used in this paper could be preferred in some real-world applications. ############################# I generally feel no reason to change my score after reading authors response. The extra experiment results are quite interesting, the performance decreasing while increasing n. It would be great if authors can provide some insights on it.