Paper ID: 1584
Title: Grammar as a Foreign Language
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors investigate a way of parsing natural language sentences by recently proposed encoder-decoder type of architectures which are mainly used for machine translation up to now. They obtained very impressive results and were able to beat state of art. This is mainly an application paper and the model in the paper is not novel.

The objectives of the paper are clearly stated with several empirical evidences supporting the authors' own claims. The fact that they were able to get better results with attention mechanism with even by just training on WSJ is quiet interesting, and that shows us the efficiency of using attention over regular encoder-decoder type architectures.

A few minor comments: * Authors mentioned that they used 3 LSTM layers in the model. But it was not very clear to me, whether if they used 3 layers in total (if yes how many layers did you used in encoder and how many layers did you have in the decoder) or both encoder and decoder had 3 layers. * It would be interesting to see some results on CCGBank as well. * I am curious about whether if a model trained with the objective of parsing a sentence can capture semantics as well. This might shed some interesting light to the syntax-semantics, or namely "Colorless green ideas sleep furiously" debate in generative semantics. You can evaluate the embeddings learned by this model.

Q2: Please summarize your review in 1-2 sentences
This is a very well-written paper with vast amount of empirical experiments.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper tackles (constituent) syntactic parsing by mapping this prediction problem to a sequence-to-sequence alignment problem, and then essentially applying a method recently developed in the context of neural machine translation (LSTM-encoder-decoder with an attention mechanism). The resulting parsing model achieves state-of-the-art results when used in the standard supervised set-up (PTB WSJ) and improves further when estimated in a semi-supervised / co-training regime.

What I find especially interesting in this paper is that the attention mechanism is crucial for attaining good generalization properties: without using the attention mechanism LSTM achieves very poor results in the supervised setting. This is an interesting observation which may in principle generate future work focusing on refining the attention model (e.g., moving more in a direction of Neural Turing machines of Graves et al.).

This is also somewhat surprising that such simple linearization strategy led to state-of-the-art performance. This is another direction which may be interesting to explore.

Previous research into incremental parsing (psycho-linguistically inspired) may inform this work (e.g., predictive LR order). However, of course, it goes slightly against the declared agenda -- defining a general method ("keeping the framework task independent", l. 179, p.5).

The fact that LSTM is accurate (or at least not less accurate than LA PCFG) on long sentences is indeed surprising. Note though a related finding in Titov and Henderson (IWPT '07): they found that using recurrence (i.e. RNN-like modeling) resulted in substantial improvements specifically on longer dependencies when compared to a NN model which lacked recurrent connections (TF-NA). However, they also did not have a convincing explanation for this phenomenon.

Other relation with this previous work (as well as with SSN work of Henderson) seems quite interesting: in the submission the neural network essentially learns which previous hidden layers need to be connected to the current one. In this past work, alignments between hidden layers were defined by linguistic considerations (~ proximity in the partial syntactic tree).

So they essentially used a hand-crafted attention model.

I would appreciate a bit more analyses of the induced alignment model. So far only examples where the attention moves forwarded are considered; it would be interesting to see when jumps back are introduced (as at the very right in Figure 4). Also, it may be interesting to see the cases where the alignments were not peaked at a single input word (i.e. not a delta function).

Alignments in the NMT model of Bahdanau et al. seem a little less peaked; I am wondering why.

One difference I seem to notice between the attention-based NMT and the parsing model is that the alignments in the NMT work are computed based on hidden layers of RNNs running *both* in forward and backward direction. I am wondering if the authors could provide an intuition why they decided to limit themselves to only using a forward RNN. It may seem that using some kind of look-ahead should be useful.

Q2: Please summarize your review in 1-2 sentences
The paper shows how a recently developed neural machine translation method can be applied to syntactic parsing.

The results are very competitive and the analysis are quite interesting.


Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary: This paper defines parsing as an efficient, domain-agnostic, linearized tree, sequence-to-sequence model. They achieve strong results when they add attention and also use large automatically-parsed corpora. Analysis and out-of-domain results are also presented.

Quality: The idea of a linearized tree, sequence-to-sequence model for constituency parsing is interesting and achieves high efficiency and accuracy. The ablation and analysis is useful. The paper is well-written but is missing certain explanations and definitions useful for a general audience. Also, there are certain issues with the experimental setup that needs clarifications or extra experiments:

-- when adding unlabeled parsed sentences from the Web, the authors should explain how much overlap there is with the dev or test sentences, because then it is more like LSTM memorizing parser-outputs during training and replicating during testing (which is then more like replicating output of a tri-trained parser). -- the comparison doesn't really seem too fair because one needs to see how a standard parser (say BerkParser) does on tons of training data (even if it is much slower, which is still a win for this work), gold or tri-trained data. The authors do train a BerkParser on more data for out-of-domain experiments so why not report that for Table1 too? Esp. because the BerkParser trained on more data does beat their work on out-of-domain. -- how are all the model decisions made, e.g., reversing, tag normalization, 510-dim for word vectors? On the dev set? Please clarify. -- why is there an in-house implementation of Berkeley-Parser when it is publicly available? What are the differences, qualitatively and quantitatively?

Other suggestions or comments: -- in Sec 3.2, explain 'the semi-supervised results' for a general audience. Similarly, explain, why the non-terminal labels are added as subscripts to the parentheses. -- more analysis is needed for the surprising result of not needing POS tags -- since reversing the input sentence helps, did you try the bidirectional encoding?

Clarity: The paper is mostly well-written and well-organized, though is missing certain explanations and definitions useful for a general audience.

Originality and significance: The paper is original and has a useful impact in terms of defining parsing as an efficient, domain-agnostic, linearized tree, sequence-to-sequence model. It achieves strong results and provides good analysis. Previous work such as linear-time incremental parsing and NN parsers are similar to this work but this work achieves higher accuracy and efficiency.
Q2: Please summarize your review in 1-2 sentences
This paper defines parsing as an efficient, domain-agnostic, linearized tree, sequence-to-sequence model. They achieve strong results when they add attention and also use large automatically-parsed corpora. However, there are certain issues with the experimental setup that needs clarifications or extra experiments (e.g., overlap between unlabeled training data and eval sets, comparison to BerkParser trained on more gold or tri-trained data, model decisions on dev or test, reason for in-house reimplementation).

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a method to learn a syntactic constituency parser using an attention-enhanced sequence-to-sequence model. The paper claims that it obtains state-of-the-art results on WSJ and performs comparatively to Berkeley parser.

The problem of syntactic parsing is an important one in NLP, and is a fundamental pre-processing step for more advanced tasks such as named entity extraction or relation extraction. Therefore, any improvement on syntactic parsing is likely to have a significant impact on the community. To this end the application of LSTM-based sequence-to-sequence coding is both timely and important.

Additional strong points of the proposed parser are its speed, and domain adaptability.

However, I have several concerns that I would wish the authors could further clarify as mentioned below.

The title of the paper and its relevance to the current work is somewhat unclear. I do understand that sequence-to-sequnce mapping has been previously applied for machine translation, but what exactly do you mean by "grammar as a foreign language"? The relevance of the title to the proposed method is never explicitly discussed in the paper. I guess you wanted to hint that you could "learn" the grammar of a language using sequence-to-sequence models. But the evaluation tasks are all monolingual adaptation task and not cross-lingual parsing tasks.

There are several statements in the paper that require further clarifications or analysis. For example, it is mentioned that replacing all pos tags by XX gave an F1 improvement of 1 point. Why is this? Removing POS info means you are loosing many important features that have been found to be useful for syntactic parsing.

What is the significance that the constructed parser matches Berkeley parser with 90.5 F1? Should not it be the case that the comparison should be done against an external test dataset and not the parser that generated the train data in the first place?

The degree of novelty w.r.t. machine learning or deep learning is minimal in the paper considering that it is using a previously proposed sequence-to-sequence encoding method based on LSTM for parsing. On the other hand, the claims of the paper and application task (parsing) is more relevant to the NLP community. The paper would be much appreciated and recognized at an NLP conference such as ACL or EMNLP than at NIPS. Therefore, it would be more suitable to submit this work at an NLP related conference.

The problem of domain adaptation of parsers has been considered in the NLP community and arguably the most famous benchmark for evaluating this would be SANCL-2012 "parsing the web" task (see Overview of the 2012 Shared Task on Parsing the Web by Petrov and McDonald 2012). Unfortunately, this line of work is not discussed nor evaluated against as a benchmark dataset in the current paper, which makes it difficult to justify the claims about domain adaptability of the proposed method.

The robustness of the parser is also a concern. It is mentioned that all 14 cases where the parser failed to produce a balanced tree are fragments or not ending with proper punctuations. Unfortunately, majority of the text found in social media such as twitter would be such cases. It is not discussed in the paper how you could address this issue (in particular that you claim the domain adaptability of the parser) in a more systematic manner than simply adding brackets to balance it.
Q2: Please summarize your review in 1-2 sentences
This paper proposes a method to learn a syntactic constituency parser using an attention-enhanced sequence-to-sequence model. The paper claims that it obtains state-of-the-art results on WSJ and performs comparatively to Berkeley parser.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper looks a bit like an application; perhaps more focus should be given to novelty and analysis. As it is I feel that this would be a good workshop paper.

In particular, it is not quite clear what part of the model is novel and what is previous work. Given the previous work on attention with RNNs: is the attention part novel?

The title, 'grammar as a foreign language' seems confusing to me, since grammar is not a language, and I feel that the title doesn't describe the method well.

This lstm uses forget gates, correct? Then Gers et al (2000) should be cited, because forget gates were absent from the original lstm. I also think that Pollack used RAAM networks to learn small parse trees in the early 1990s - this should be cited as well.

Q2: Please summarize your review in 1-2 sentences
Light review of Paper #1584: "Grammar as a Foreign Language"

This paper shows an interesting application of attention-based lstm to sentence parsing, using a dataset created by traditional parsers, achieving state-of-the-art results on certain datasets.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their helpful and extensive comments, which we address below. We write BP for BerkeleyParser.

Reviewer 1
We want to thank this reviewer for pointing out interesting related work that we were not familiar with and for emphasizing the main points of our work.

As suggested, we will look for more interesting examples on how the attention mechanism operates. We may also publish raw attention masks and our parsed results on the development set.

Regarding peakiness of attention, we see parsing as an almost deterministic function, whereas translation is more vague. Indeed, we attain a low perplexity (~1) compared to ~5 in translation models. So the parsing model is more confident, which could explain the peakiness in the attention.

Regarding bidirectionality, the model has, in principle, access to the full context because we provide hidden activations from the last timestep to the decoder. However, it could help even more, and we plan to investigate it.

Reviewer 2
We are grateful to this reviewer for making a very good point regarding overlap from sentences on extra data. We agree that having extra unsupervised data increases the amount of intersection. We did not investigate this in the current version of the paper. Given that we have only trained the model for 5 epochs on the 10M corpus, we doubt that intersecting sentences have a big effect. But for the final version we will re-train our model and make sure there are no duplicates between the training and evaluation sets.

Regarding the performance of BP with extra training data: it only increased BP's F1 score on the WSJ dev and test set insignificantly (by 0.1), but we will add this to Table 1. Note that even when trained on human-annotated data we achieve competitive results vs. BP. When training on more data, our model improves while BP seems to saturate.

As for the decision about model parameters: all of them were made by measuring performance on the development set. The test set has been evaluated fewer than 5 times over months of research.

Regarding our in-house reimplementation of BP, this is necessary in order to integrate it better with our infrastructure and to be able to run large-scale distributed experiments. Its implementation follows closely the original work of Petrov. The only deviation is that it uses a feature-rich emission model (as in "Feature-Rich Log-Linear Lexical Model for PCFG-LA Grammars", Huang & Harper, IJCNLP 2011).

Minor suggestions: we did not try bidirectional encoders (see the reply to Reviewer 1). As for the reason why removing POS-tags gave an F1 improvement, it is indeed very interesting and we will discuss it more extensively in the final version.

Reviewer 3
(1) Regarding the title, we agree that it is a bit obscure. We are considering to modify the title or to add extra clarification in text to show what we meant -- that one could map a sentence to its grammar using the same techniques as mapping a sentence to its translation.

(2) As for the removal of POS tags, see the reply to Reviewer 2.

(3) "What is the significance that the constructed parser matches BerkeleyParser (BP) with 90.5 F1?" We are sorry for this misunderstanding, we should have formulated it more carefully. By "match" we mean that the final F1 scores of BP and our model trained on the same data are similar -- we do not compare the output of our model with the output of BP. We, indeed, use an external test dataset with gold labels (the standard WSJ test set) and for the 90.5 score we trained only on the standard WSJ train set.

(4) Regarding novelty, we feel that the data-efficiency observation with regards to the attention mechanism and the fact that we can generate trees (it was conjectured that recursive Neural Nets are necessary) are significant machine learning contributions.

(5) The reviewer mentions the SANCL-2012 task and says "this line of work is not discussed nor evaluated". This looks like an oversight: we did use the SANCL-2012 WEB task [8] as discussed in line 290 (we followed the more recent recipe in Weiss et al., ACL'15). Further, we also use QTB [9] in addition to WEB [8] and report the performance of our model (which beats all numbers form [8]). Good performance on these tasks gives us reasons to claim domain adaptability and robustness. We will put more focus on it in the final version and we encourage the reviewer to re-read this part (Page 6, Performance on other datasets).

Reviewer 5
(1) Regarding layers in encoder / decoder: we used 3 layers on both encoder and decoder, we'll work to clarify this in the final version.

(2) We did visualize the embedding learned by this model (both word and sentence), but did not add this part due to space limitations. We may add it as an appendix in the final version.

Reviewer 6
Regarding novelty and the title, please see our response to Reviewer 3. As for the extra reference, we will add it in the final version.