NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 3418 Neural Machine Translation with Soft Prototype

### Reviewer 1

The paper addresses the problem of lacking global context on the target side when generating output from left to right. It starts by pointing out that other methods which do decoding in more than one step (by e.g. first generating the sequence right to left) are inefficient computationally. The work proposes a soft prototype, which combines multiple potential outputs on the target side and is more efficient than previous work. The motivation is very clear and the paper is, for the most part, well written and explained. The approach is not ground-breaking, however it is a good approach that makes sense for addressing the problem considered and if more efficient than previous work. Better MT quality improvements would have made a stronger case, however the method is clearly better than previous work and the improvements are stable across different settings. Specific questions/comments: - Section 3.2: how does this become non-autoregressive? By not modeling any dependency between the rows of G_y (i.e. between gx(i))? I did not find the discussion at the end of 3.1 to be very helpful to understanding 3.2 - 3.2 seems the simpler formulation. - Table 1: Does the number of params/inference time refer to the French or German model? The different vocabulary sizes have to lead to different number of parameters. - Minor: Not sure I fully understand the choice of term “prototype” in this work - the term makes more sense when the data is retrieved (citations 4 and 5 in the paper).

### Reviewer 2

- The paper is based on the similar motivation of previous studies on NMT with prototypes. The actual method, that uses extra encoders for additional information, looks also a usual way of the multi-source NMT settings. Though the originality is limited, the whole construction of the model looks well. In the formulation of the method, they discusses the difference between the proposed method and existing studies. - No enough analysis of the results found. The "Case Study" section does not distinguish their improvement against random changes. - The method is enough simple to reproduce in any existing toolkits, and readers may be able to implement the same method by reading the paper (though there are some formulation mistakes). - The translation quality improvement looks quite good, though they tested the methods by only syntactically similar language pairs.

### Reviewer 3

Neural Machine Translation with Soft Prototype The paper suggests to equip a neural machine translation system with a soft prototype in order to provide global information when generating the target sequence. The suggested approach shares similarities with a multi-pass decoding strategy such as in deliberation networks, however, with the difference that the prototype is not a hard sequence of tokens but a soft representation. To achieve fast inference speed and only a small increase in terms of model parameters compared to the baseline system, the authors share the parameters between the Encoder network and the additional network used to encode the soft prototype. Experiments are conducted for three different setups on the WMT EnDe and EnFr tasks: a supervised, a semi-supervised and an unsupervised setting. The proposed technique yields gains between 0.3 and 1.0 BLEU points depending on the setup over their corresponding baselines and are claimed to achieve new state-of-the-art results. Overall, this is a good paper with nice results given the fairly small increase of model parameters, and I would like to see this paper at the conference. I think that the exposition of the soft prototype could be improved by putting more emphasis onto the actual model that the authors finally choose to use. The initial formulation is quite general and fairly different compared to what is eventually used in the paper, and it would be interesting to see results on different network incarnations of the soft prototype framework. E.g., one natural question is how the network would fair if the network $Net$ would not share its parameters with the encoder. Besides that I have only a few suggestion, comments and questions as stated below. The claim that the suggested approach yields state-of-the-art results is not entirely true, as there have been other techniques that surpass the best BLEU score results reported in this paper. E.g., the publication "PAY LESS ATTENTION WITH LIGHTWEIGHT AND DYNAMIC CONVOLUTIONS" by F. Wu et al. reports a BLEU score of 29.7 on the EnDe WMT translation task. L260: The analysis of the first German sentence is not entirely correct: "sie" has no corresponding word on the source side. The phrase "how to" is the one that corresponds to "wie sie". A more literal source sentence for the generated German target would have been "how [they should] talk". The word "their" corresponds to "ihrem". Editorial comments: L275: replace "derivation" with "derivative" ------- Thanks to the authors for sharing their responses. Since I have considered this submission to be among the top 50% of accepted submissions, I still stand with my original rating: This paper should get accepted for publication at NeurIPS 2019.