Paper ID: 43
Title: Expressing an Image Stream with a Sequence of Natural Sentences
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper deals with a problem of retrieving natural sentences that describe a given sequence of images and forming a sequence of sentences. This problem is definitely significant in the research fields of computer vision and pattern recognition, but it is closely related to generating multiple sentences that describe given video content (e.g. [Rohrbach+ 2014]): Single video content can be regarded as a sequence of images, since video content can be separated into multiple video shots and every shot can be well described by a key frame. However in this paper, none in this line of researches have not been cited.

This paper is well written and easy to follow. The problem is well stated and solved with a set of solid techniques.

The proposed neural network model named CRCN acquires relationships between a sequence of natural sentences and a sequence of images. The architecture of the proposed model is technically sound and it also describes discourse relationships very well with the help of entity-based coherence models.

The method for generating a sequence of natural sentences is reasonable but rather primitive: Every natural sentence is directly associated with training images similar to a given image, and all the sequences are simply concatenated in the order of the corresponding images.

In Section 4.1, "We reuse the blog data of Disneyland from the dataset of [11], and newly collect the data of NYC, using the same crawling method with [11]," The data set has not been disclosed and the corresponding paper does not describe the details how to crawl the data set. This indicates that the authors of this paper is definitely almost the same as the ones in [11] and thus this situation deliberately lacks the anonymity.

Equation (2) might be incorrect, since it implies that s_t can be derived from only o_t and thus the network is not fully connected.
Q2: Please summarize your review in 1-2 sentences
The proposed neural network model is technically sound and it also describes discourse relationships well with the help of entity-based coherence models. Meanwhile, the method for generating a sequence of sentences for a given image stream is rather ad-hoc. I think that this paper can be accepted as a poster paper as is.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper attacks the problem of describing a sequence of images from blog-posts with a sequence of consistent sentences. For this the paper proposes to first retrieve the K=5 most similar images and associated sentences from the training set for each query image. The main contribution of the paper lies in defining a way to select the most relevant sentences for the query image sequence, providing a coherent description. For this sentences are first embedded in a vector and then the sequence of sentences is modeled with a bidirectional LSTM. The output of the bi-directional LSTM is first fed through a relu and fully connected layer and then scored with a compatibility score between image and sentence. Additionally a local coherence model [1] is included to enforce the compatibility between sentences.

Strength / positive aspects: - The paper proposes a novel and effective architecture to retrieve coherent sentences for an image sequence. - The paper provides an extensive quantitative, qualitative, and human evaluation, showing the superiority of their approach against several baselines, not using coherence. They also provide an ablation experiment, removing the coherence model [1]. - The authors promise to release source code and dataset.

Weaknesses / Questions / Unclarities: 1. Line 94: The paper claims that there is "no mechanism for the coherence between sentences" in [5]. Although not the contribution of [5], [5] predicts an intermediate semantic representation of videos, which is coherent across sentences by modeling the topic of the multi-sentence description. 2. Correctness/clarity: Figure 2b does not seem to correspond to the description 3.3/Equation (2). While Figure 2b implies that the fully connected layer are connected to all sentences, this it not the case in Eq2, which implies that the parameters are shared across sentences, but only connected to the vector representing a single sentence. 3. A better metric to automatically evaluate the generated sentences is Meteor (http://www.cs.cmu.edu/~alavie/METEOR/) instead of BLEU, especially if there is only a single reference sentence. 4. Why two linear functions in Eq2 (W_{f2}, W_{f1}) are applied behind each other? Given that two linear functions are again a linear function the benefit is unclear. An ablation study showing the benefit of these functions would be interesting. 5. Why the same parameters of the fully connected layers are used for the BRNN output (o_t) and the local coherence model q (Equation 2)? 6. Is the paragraph vector [16] fine-tuned or kept fixed?

=== post rebuttal === After reading the rebuttal I recommend the paper for acceptance. The authors successfully addressed issues with the formulation, evaluation, and related work.

Please make the promised changes to the final and also clarify the following point in the final. 6. Is the paragraph vector [16] fine-tuned or kept fixed?
Q2: Please summarize your review in 1-2 sentences
The paper proposes an interesting new model to retrieve coherent sentences for an image stream, which is convincingly evaluated. However, to be a convincing paper, several clarifications have to be made.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This work studies how to generate sentences from an image stream. It designs a

coherent recurrent convolutional network (CRCN), which consists of convolutional neural networks, bidirectional recurrent neural networks, and an entity-based local coherence model.

Overall it is a nice work, although it can be improved from the following aspects: * Several related work about video to sentence is missing, e.g.,

Jointly modeling deep video and compositional text to bridge vision and language in a unied framework

*While the quantitative results of the proposed method looks quite good, the user study in table 2 shows that it performs similar to one baseline, RCN. Significance test is needed to verify whether the improvement is reliable.
Q2: Please summarize your review in 1-2 sentences
Nice algorithm for sentences generation from an image stream.

The quantitative results of the algo looks good but the user study only shows weak advantage over baselines.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank reviewers for acknowledging the novelty of our problem, the appeal in the formulation, and convincing evaluation.

1. Retrieving instead of generating (R1,6)
Many previous sentence generation methods first build the vocabulary dictionary, from which a sentence is created by sequentially retrieving the highest scored words. Our approach has one-level higher granularity; we build the sentence dictionary, from which we make a passage by retrieving sequences. It is a main analogy that we use the same term 'generate', but following the reviewers' suggestion, we will replace the 'generate' with 'retrieve'.

2. Nonlinearity in fully connected layers (R1,3)
The reviews remark that the simple multiplication of the last two FC layers results in a single linear mapping. However, it was not, thanks to dropout. We assign 0.5 and 0.7 dropout rates to the two layers. Empirically, it improves generalization performance much over a single FC layer with dropout.
During rebuttal period, we also test the case of adding ReLu to the FC layers, but one key side-effect is that it severely suppresses the output of coherence model (i.e. g in Eq.(2) is quickly close to 0). Thus, the additional ReLu makes CRCN work like RCN. In NYC dataset, we obtain R@1, 5, 10, Med values as follows.
RCN 3.8, 18.3, 30.2, 29.0
CRCN+ReLU 6.9, 20.7, 33.3, 28.0
CRCN 11.7, 31.2, 43.6, 14.0

3. Evidence that CRCN is better with longer passage (R1,4)
Since the reviews' comment is very reasonable, we run new sets of AMT tests to check how the performance margins between RCN and CRCN vary according to the lengths of query image streams.
Pairwise preference of CRCN over RCN
5 images -> 8-10 images
NYC: 54.0% (81/150) -> 57.0% (131/230)
Disney: 56.0% (84/150) -> 60.1% (143/238)
This result supports our argument that as passages are longer, the coherence becomes more important, and thus CRCN's output is more preferred by general turkers.

4. Error in Eq.(2) (R2,3)
Thank you for the correction. The Eq.(2) should be
[S | g]= W_f2*W_f1*[O | q]
where O = [o_1|o_2| ... |o_N] , S = [s_1|s_2| ... |s_N].
We use the shared parameters for O and q, because we want the retrieval output to mix well the interaction between the content flows by BRNN and coherency. Our empirical results show this joint learning outperforms learning the two terms with separate parameters.

5. Better metrics instead of BLEU (R1,3)
Following reviews, we compute Cider and Meteor metrics. We observe the tendency of Cider and Meteor is similar to that of BLEU. Note that with retrieval metrics (R@K, MedRank), CRCN significantly outperforms RCN. For Disney dataset,
Baseline | Cider | Meteor
CNN+LSTM 10.0 4.51
CNN+RNN 0.4 1.34
MLBL-B 19.7 8.03
GloMatch 2.2 4.31
1NN 19.5 7.46
RCN 51.3 8.87
CRCN 52.7 8.78

6. Referring to video-sentence work (R2,3,4)
Multiple reviews suggest comparing with the video-sentence work (e.g. Rohrbach+2014, R. Xu+2015), which will be cited in the final draft.
Our key novelty is that we explicitly include the coherence model, which is more critical for image streams than videos. Unlike videos, consecutive images in streams may show sharp changes of visual contents, which cause the abrupt discontinuity between contents of consecutive sentences. Thus the coherence model is more demanded to make output passages fluent.

7. Anonymity (R2)
As R2 pointed out, the dataset of [11] is not publicly available. But the authors of [11] were open to share their dataset upon request.

8. Why CRCN better than RCN in retrieval? (R1,3)
CRCN is only slightly better than RCN in language metrics but significantly better in retrieval metrics. It means that RCN is OK with retrieving fairly good solutions, but not good at ranking the only correct solution high compared to CRCN.
The coherence model crates descriptors that capture various aspects regarding coherency, and the interaction between this descriptor and BRNN output are jointly learned via two FC layers. As R1 mentioned, the test text streams to be ranked are already coherent in some sense (because they are passages written by human), but our model does not simply summate the content and coherence terms, but learns the complex interactions between them; such modeling power can help pinpoint the correct retrieval.

9. Difference from Karpathy's [9] (R1)
R1's comments are correct. Plus, we add three more differences. First, our method includes the coherent model for smooth sentence transitions. Second, the final retrieval output of [9] is existing sentences in the training set, whereas our output passages do not exist in the training set. Third, by nature of our problem, the compatibility of sequential ordering is more important than [9] that parses the image into multiple semantic regions, and measures their fitness with words of a sentence with rather free ordering.

All the other comments are about the details of algorithms and experiments, which will be resolved in the final draft.