Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
In general, I like the paper: it tackles previously under-explored task and proposes a novel approach to tackle it. The paper is well-written and easy-to-follow, however I suggest improving exposition of "graph context encoding" by providing more detailed explanations or preparing illustrations. The overall approach appears to be novel to me. The main idea is to train a generative model of pixel representations that is conditioned on text embedding of seen and unseen classes. Later, authors use this model to train a classifier for unseen classes. As an additional step, authors demonstrate that pseudo-label technique can be used to further improve performance. The experiments are conducted on two challenging datasets and quantitative/qualitative results appear to be rather convincing. I am mainly concerned with the choice of baselines. Authors implement only a single baseline. In theory, it may be sufficient, considering that zero-shot segmentation learning appears to be a novel task. However, I wonder whether other classical ZSL learning techniques can be trivially adapted for ZSL segmentation and how they compare to the proposed technique.
Originality: The problem tackled is in itself original, it is well motivated (not artificial), and can open a new line of research, combining the best of semantic segmentation and zero-shot learning. The approach proposed is a straightforward adaptation of  to the proposed settings. Quality: Given that the problem is new, there were no real competitors, so the authors came up with a sound baseline. The main approach described in the paper is reported to work better than this baseline on Pascal VOC and Pascal-Context datasets. The choice of datasets is correct, although semantic segmentation papers tend to report results also on others such as COCO, and Cityscapes. Clarity: The paper reads well overall, and the figures (mainly Fig. 2) help in the understanding. The problem is well motivated, and the paper organization is ok. Significance: As mentioned above, other researchers may get interest in the proposed problem, trying to get better results at this. In this sense the contribution of the paper is important. The contribution about zero-shot learning and self-training (ZS5Net) seems to miss that there is a subarea within ZSL, called transductive ZSL, that has studied this same problem before, with papers such as: Zero-Shot Recognition via Structured Prediction. Zhang & Saligrama, ECCV 2016, Zero-Shot Classification with Discriminative Semantic Representation Learning, Ye & Guo, CVPR 2017. Zero-Shot Recognition using Dual Visual-Semantic Mapping Paths, Li et al, CVPR 2017 Zero-Shot Learning via Class-Conditioned Deep Generative Models, Wang et al. AAAI 2018, Transductive Unbiased Embedding for Zero-Shot Learning, Song et al. CVPR 2018.
Writing ---------- The paper is well written and well positioned wrt previous work, overall. However, the main technical core of the paper is only about 1.5 pages long. It would have been better imho to have less detailed results tables and reuse that space for explaining the technique in greater detail. Novelty ---------- While I agree that this paper is the first to address zero-shot semantic segmentation, there have been many papers on zero-shot image classification, abd also some papers on zero-shot object class detection. Importantly, the technique proposed is a direct adaptation of , a previous technique for zero-shot image classification. The way the authors adapt it appears to be straightforward, also because the problem of semantic segmentation is (as usual) tackled as a pixel classification problem. So, instead of one class per image, you have one class per pixel, and the authors apply  essentially unchanged. The other claimed novelty points in my opinion do not add substantially to the paper: (1) the self-training strategy (lines 144-152) is a classic from the semi-supervised learning literature, and is described in 9 lines in this paper. Besides, it breaks the usual zero-shot learning setting, but that's not the main point. (2) the graph-context encoding in my opinion requires an unrealistic data setup to work: you need segmentation mask annotations for the unseen classes, but without the images themselves. I cannot imagine how this could ever happen. In fact, this corresponds to full supervision for the unseen classes too. With this in mind, the authors should then compare this module of their method to previous context modeling techniques from the fully supervised semantic segmentation literature. Experiments ----------------- The results appear to be good overall, with especially big gains brought by the zero-shot self-training. There are a few methodologically subtle points though: - lines 179-188: it's great that the authors carefully tease out which of their unseen classes do not appear in the ImageNet dataset used to pre-train their models. However, these are only two classes, making their experiments with 4,6,8,10 unseen classes less valid. Moreover, I believe the right protocol would have been to simply remove all of your unseen classes from ImageNet, and then use that ablated dataset for pretraining your models. That would lead to a fully clean protocol. - as a baseline, the authors adapt a zero-shot classification approach  dating back to 2013. This does not seem satisfactory, given that there are many more recent zero-shot classification papers, as cited in lines 30-39. In fact, this paper itself is an adaptation of a recent zero-shot classification work  (2017). So there is a continuum of baselines possible, and 2013 really feels outdated. Reaction to the rebuttal ================== The rebuttal is very well written and reduces my concern on novelty a bit, as adapting  to pixel-level is indeed not that trivial. A second reply that helps upgrade my opinion is that the authors state that a recent study has shown that the Devise baseline  is anyway already very good. Moreover, they also add a more recent baseline . Perhaps the core question for accepting this paper is: why is the work of adapting the zero-shot classification baseline  considered 'a baseline', whereas adapting another zero-shot classification work  is considered a contribution worthy of NeurIPS? One possible answer is that they are both too simple to grant acceptance. But another possible answer is that there is sufficient (marginal) novelty in either of them. The other replies are unconvincing: on retraining on properly ablated ImageNet, the authors essentially say 'it's too much work' and 'make training from scratch challenging'. But this is the core of the meaning of zero-shot learning! If we say 'it's too hard to evaluate in zero-shot setting in practice', then we should not work in this field. We are trying to be scientific after all. Moreover, the answer on the realism of graph context is really not convincing: the authors just say 'it's one form of prior, it could be another one'. But the point is: THIS form of prior is unrealistic. You cannot have segmented object outlines but no image pixels. In the light of the overall novelty of the task itself, the good results (at least compared to some sort of baselines), and the rebuttal, I raised my rating to 5. It would not be so bad if this paper gets in.