NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5260
Title:Learning to Learn By Self-Critique

Reviewer 1

This paper provides a learning-to-learn approach that takes advantage of the information from the test set via a transduction learning setup. Although past meta-learning literatures evaluate the few-shot generalisation performance in an implicit transduction setting, this is the first approach that trains a network explicitly to learn from unlabelled test data. It conducts extensive ablation study on the type of conditioning features for the learnt cost function, and shows SOTA performance on Mini-Imagenet and Caltech-UCSD Birds 200 benchmarks. The motivation and the algorithm of the paper is well explained. Figure 1 and Algorithm 1 are very useful to understand the proposed algorithm. What is not clear to me is how the information of the entire target set used to predict individual test images. For every image, does the loss network computes features and adapt model parameters based on the information from that image alone or from all the images of the entire target set? I guess the latter setting may give better performance since the information of other unlabelled images is useful for the image of interest and the original MAML paper also used all the test images for transductive learning. In that case, it would be important to understand how the prediction accuracy depend on the size of the target set and if it generalises to a different size from the meta-training setting. In the experiment section, the performance of SCA on Low-End MAML is not studied for the CUB dataset. It would be nice to show the results in order to show the generality of the algorithm under different settings. ------- Update after feedback: Thanks for clarification on the test batch size. It would be nice to clarify it in the final version as I imagine the performance would depend heavily on the size of test set.

Reviewer 2

Summary: This paper considers few-shot classification and seeks to make use of the unlabeled query data during few-shot classification by training on it with a meta-learned critic loss. The algorithm builds on top of MAML, and has two stages. In the first stage, the model is adapted via gradient descent on the labeled support set. In the second stage, the model is further adapted via a meta-learned critic loss that is a function of a featurization of the model parameters and the unlabeled query set. Originality: The proposed approach strikes me as quite similar to One-Shot Imitation Learning by Domain-Adaptive Meta-Learning (Yu et al. 2018). In that paper, they similarly learn a critic loss, in their case to adapt an RL agent to a human demonstration in which ground truth actions are not available. This citation should be added to the paper. More distant but still quite related is Evolved Policy Gradients (Houthhooft et al. 2018), which meta-learns a loss function for an RL agent. Quality: A few clarifying questions: What is the point of Figure 2 (shows the effect of the critic on output probabilities)? The caption is entirely descriptive of the plotted data. In no case presented does the critic actually change the top prediction! Are the network backbones the same across all comparisons in Table 2? This is very important to ensure fair comparison (see Chen et al. 2019.) In particular, SCA performs on par to LEO - are the backbones the same here? Incidentally, the idea of meta-learning a loss function is general and could be applied to other meta-learning algorithms besides MAML, I don’t see a need to restrict it to gradient-based in line 122 and other places. Clarity: The writing is clear for the most part (Sections 3 and 4 are a bit long-winded). Figure 1 is very helpful. It would be good to explain what you *expect* the critic loss to learn, for motivation. Significance: low/medium - mainly due to concerns about the validity of the results, additionally due to the overlap of the idea with previously published works. ----------------- Post-rebuttal ------------------ Thank you for clarifying the comparisons in Table 2. I feel confident now that the most important comparison (with regular MAML) is correct. I also appreciate that while it’s good to contextualize the results with respect to non-MAML based methods, this is not critical to prove the point about transduction. I’m satisfied the authors did their best in this regard. However, I would like to say that I strongly disagree with the statement, “It is fair to compare methods on the quote results on the same benchmarks.” Network backbones and training techniques improve over time, therefore it’s unreasonable to compare directly with numbers reported in older few-shot papers that built their algorithms on top of what are *now* antiquated methods. It is important that as a community we do not waste time chasing incremental improvements that are revealed to be an illusion when an older method is ported to modern times. I still feel that Figure 2 could be improved. Perhaps an analysis of how often the critic changes the predictions, or in what specific cases? “It doesn’t do nothing” is a pretty low bar for your method… Thanks for agreeing to add the requested citations. I agree that your contribution is distinct. I do think that formulating transductive meta-learning more broadly to include meta-learning besides gradient-based approaches would make the paper more impactful.

Reviewer 3

Originality: To my knowledge this is the first transductive learning MAML paper. Quality: The results are compelling. Clarity: The paper is easy to understand. Significance: The result should be interesting to anyone who works on meta learning with neural networks.