Paper ID: 604
Title: Convolutional Neural Networks with Intra-Layer Recurrent Connections for Scene Labeling
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a recurrent convolutional neural network for semantic image segmentation to encode and take advantage of contextual relationships. The method is basically a combination of [13] where a very similar RCNN is used for object recognition, and [4] from which the multi-scale pipeline is inspired. Thus, technically, the paper is not very novel (Sec. 3.1. is much the same as in previous works - maybe that should be stated more clearly; the combination can be seen as a novelty of course). However, the paper is well executed, very easy and clear to read, largely well written and provides a seemingly fair evaluation to state-of-the-art. Some recent works or interesting evaluations could be added, see below. Results are shown on the Sift Flow and the Stanford Background dataset where the proposed technique outperforms state-of-the-art using a limited number of training data for all approaches (limited by these datasets). Also, parameter ablation studies are conducted to some extend. The method is very efficient.

Detailed comments:

- The related work section could be written more sharply. It is not always crystal clear what the differences are with respect to the closest related works.

- Results on PASCAL VOC could be added as most recent segmentation works evaluate on those datasets.

- l.276 is very vague and should be made more clear.

- Table 1: Why are only \gamma \in {0,1} considered? The paper could provide experiments/plots with varying gammas.

- page 7: The ordering of text/plots/tables should be changed to have less interleaving text/plots/tables.

- The model mentioned in l.387 should be added to the table, maybe with a footnote that it uses different training data or a line break.

- Some further qualitative analysis would be nice if it fits.

- This recent work seems to be missing (the first one provides even stronger results on Stanford Background):

Feedforward semantic segmentation with zoom-out features Mohammadreza Mostajabi, Payman Yadollahpour and Gregory Shakhnarovich Toyota Technological Institute at Chicago

@inproceedings{crfasrnn_arXiv2015,

author = {Shuai Zheng and Sadeep Jayasumana and Bernardino Romera-Paredes and Vibhav Vineet and Zhizhong Su and Dalong Du and Chang Huang and Philip Torr},

title = {Conditional Random Fields as Recurrent Neural Networks},

booktitle = {arXiv:1502.03240}, year = {2015}

}
Q2: Please summarize your review in 1-2 sentences
The paper is well executed and provides a principled combination of two existing techniques. The results seem convincing, up to some additional studies which would benefit the paper (see below).

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a pixelwise classification system using recurrent convolutional connections.

The method employs a multiscale convolutional network similar to Farabet et al., but introduces shared weights and additive skip-connections between layers in the networks applied at each scale.

The model is evaluated on Stanford Background and SIFT Flow datasets.

This approach appears to achieve good performance for being trained from scratch, but I think the several aspects could be better explored and evaluated.

* First, the effect of sharing weights between recurrent connections could be compared to the same network with different convolution kernels in each iteration; while this introduces more parameters, it might also stand to increase performance for about the same computational cost (i.e., same number of connections).

This was almost performed, but the CNN2 model has fewer total layers and a smaller RF.

It is also possible the findings for varying gamma (enabling/disabling skip connections) might change under these conditions.

* Second, the effects of N and T could be further expanded upon.

From what I can tell, all the RCNN networks use T=3, except RCNN-large, which uses T=4.

But RCNN-large is evaluated only for N=3.

How does it perform at N=4 and N=5?

* In addition, Table 2 RCNN has a reported performance of 83.5/35.8 PA/CA.

But in Table 1, RCNN-large has what looks like better performance at 83.4/38.9 PA/CA (last line).

Is there a reason the latter wasn't used in Table 2?

* The related work section states that [4] and similar networks rely on postprocessing, e.g. superpixels, to ensure consistency of neighboring pixels labels.

This seems to imply the proposed model does not suffer from this problem, but this is not evaluated in the experiments or with qualitative examples.

No example predictions are shown, so it isn't really clear what the outputs look like qualitatively.

* There are also some other recent works in this area that I think could be discussed or compared with, particularly Chen et al. "Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs", and Eigen & Fergus "Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture"
Q2: Please summarize your review in 1-2 sentences
This approach appears to achieve good performance for being trained from scratch, however I think some aspects could be better explored and evaluated.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors claim that the effective receptive field of each unit expands with larger T (Line 174 in Page 4). I didn't get this point. As you input an still image to the network, how can the receptive field of each unit get expanded with larger T? Can you clarify on this?
Q2: Please summarize your review in 1-2 sentences
The paper proposes a recurrent convolutional layer for CNN to improve the performance of scene image parsing. The algorithm looks technically sound and the result looks good.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
[this is a light review] Specifically similar approache: - Shuai Zheng; Sadeep Jayasumana; Bernardino Romera-Paredes; Chang Huang; Philip Torr, http://arxiv.org/abs/1502.03240. - [15] should be included in the results table, maybe with a star to denote different/more training data.

Q2: Please summarize your review in 1-2 sentences
The paper adapts the recurrent neural network approach for object detection [13] to scene labeling/semantic segmentation. The approach outperforms baselines but not state-of-the-art. The authors should relate their work conceptually, quantitatively and w.r.t. the difference in approach more clearly from the semantic segmentation task & datasets and corresponding approaches (see e.g. http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?challengeid=11&compid=6 for a list of winning approaches).

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a method for scene labelling based on Recurrent Convolutional Neural Networks, where the output of a convolutional layer is used as an additional input of the same layer (this is implemented by duplicating the layer several times). The input to the network is the input image plus several downscaled versions of of the input image to exploit contextual information better.

The approach is tested evaluated on two datasets, and compared to previous methods. The accuracy improvement is good, and the computation time impressive as the algorithm can run entirely on the GPU.

I found the paper interesting as it discusses very trendy issues but I have two main concerns: - the text is difficult to follow. For example, the use of the passive form in the first sentence of the last paragraph of the introduction makes it ambiguous. It took me some time to understand (or guess) that the authors meant "we adopt a multi-scale version of RCNN". - while the results are interesting, the contribution is limited, as it is only an application of RCNN (which were already applied to computer vision problems before) to image labeling.

More minor comments: - Section 3: state explicitly that RCL stands for recurrent convolutional layer. Same problem with LRN - just before Eq (3): it should be g(Z_ijkz), not sigma(z_ijk) - is it really worth discussing gamma?

The results with gamma = 0 are not as good as with gamma = 1, which is the "standard" way. This should not be surprising, because if a small gamma was interesting, the network could learn to use large values for w_k^rec
Q2: Please summarize your review in 1-2 sentences
This paper presents a method for scene labelling based on Recurrent Convolutional Neural Networks. Results are interesting, but the text is difficult to follow and the contribution seems limited.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank all the reviewers for their acknowledgements of our work and for the valuable suggestions. The indicated missing references will be added, and the tables and text will be reorganized to make the presentation clearer. Below we address the comments in detail.

To Review_1:
Thanks for your suggestion on the related work section. We will modify it to emphasize the differences with the closest related works.

The results on VOC are not included because the state-of-the-art models on these datasets usually use the ImageNet pre-trained models, such as VGG net or GoogLeNet. But we agree that these results are good complement to the present experiments. If there is enough space, these results will be added.

In L. 276, we mean that data augmentation is not used in model analysis, but is used in comparison with the state-of-the-art models because these models also use data augmentation. We will clarify this point.

We did test some values of gamma between 0 and 1, and some values had led to a little bit better results than gamma=1, but the difference was small. To save space these results were not included. We will add these results by reorganizing the presentation.

The model mentioned in L.387 will be added to Table 2.

Good suggestions for adding qualitative analysis. We will do that.

Thanks for the indication of the missing references.


To Reviewer_3:
We are very sorry for the ambiguity in writing. We will try our best to improve the writing and eliminate these ambiguities.

The current work is a nontrivial application of RCNN. It is worthwhile to show that a model good at object recognition is also good at scene labeling as the two tasks are different. That's also why many other deep learning models originally used for other tasks are adapted for scene labeling (e.g., [4,15,19]). Two advantages of RCNN for scene labeling refer to its simplicity and efficiency, which are believed to be favored by the practitioners working in this field.

Thanks for indicating the missing notations of RCL and LRN and the typo just before Eq (3). They will be corrected.

By comparing different gamma, we intend to emphasize the effect of the multi-path structure. When gamma=0, there is only one path from input to output.


To Reviewer_4:
Thanks for the acknowledgement for our contributions.


To Reviewer_6:
Yes, the input is a still image. The dynamics comes from the recurrent iterations. The RCLs are unfolded through time for T steps. In each step, each unit receives recurrent inputs from its neighboring units, and these inputs bring information in a larger context in the layer below. In other words, in each step, a larger area (receptive field, RF) in the layer below modulates the activity of each unit in the current layer.


To Reviewer_7:
Thanks for indicating the webpage and the reference. An explanation for not having included the VOC results is found in our reply to Reviewer_1. We will modify the paper to clarify the difference of RCNN with other approaches, and add more experimental results if space permits


To Reviewer_8:
Thanks for the suggestion to compare the shared and unshared weights. This is indeed useful to illustrate the effect of weight sharing. Our results showed that changing the shared weights to unshared weights made the score dropped by about 1%. It may be due to the over-fitting with more parameters. We will add these results in the experiment section.

Yes, the findings about gamma may change under different conditions, e.g., when the model is smaller. We'll try different settings.

More iterations lead to larger effective RFs for the units. Because RCNN-large contains N=4 RCLs, and the effective RFs of top units already cover the entire image with T=3 iterations. Further increasing the iterations will not make their effective RFs cover more areas in the image. But RCNN contains only N=2 RCLs, and we can use more iterations. Sorry we made a mistake: the N's in Table 1 should be T's. We guess this typo has caused confusion to you.

Yes, RCNN-large did perform better than RCNN. But to emphasize that RCNN with relatively few parameters could achieve good results, we didn't compare RCNN-large with existing methods in Table 2 because it had relatively more parameters (though still fewer than some existing models). It was only used to show that with more parameters (but not more data) the accuracy of RCNN could approach that of a model pre-trained on ImageNet (see L.387).

Good suggestions for adding qualitative analysis. We will do that.

Thanks for indicating the related references. We will add discussions and comparisons with them.