NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 3306 Compositional De-Attention Networks

### Reviewer 1

This paper introduces Compositional De-Attention (CoDA), a novel attention mechanism that learns to add, subtract or nullify input vectors instead of simply make a weighted sum of them. They do so by multiplying the output of a tanh and a sigmoid function resulting in a attention mechanism biased towards -1, 0 and 1 instead of the softmax’s bias towards 0 and 1. The authors qualified CoDA as quasi-attention since it does not make use of the usually used softmax function. They demonstrate through extensive experiments the power of this novel attention mechanism and gain state of the art on several datasets. Originality: The idea of mixing the sigmoid function and a tanh function in order to capture negative, zero and positive correlations in attention mechanisms is quite novel. None of the existing works on attention mechanisms exploit this idea. The authors clearly differentiate their work from previous works. Related work has been cited adequately. Quality: Experimental results show that CoDA gives an edge compared with traditional attention mechanisms. Indeed, they evaluate it on several tasks and datasets as enumerated in the Contributions section and they indirectly gain state of the art on many of these datasets by doing so. While they evaluate on a wide variety of tasks, it seems to be some kind of work in progress since only a few datasets per task are evaluated. For instance, for the task of open domain QA, they do not evaluate on the SQuAD dataset which one dataset usually used for this task. By evaluating on this dataset, they could also have compared with BERT (Transformer) architecture which they do not on this task. In Section 4, we can see through figures that the attention learned gravitates towards 0 and small negative values and is quite similar to what the tanh function alone outputs. This raises the question whether the sigmoid has any impact on the final outcome. An ablation study should be done on that. Finally, visualization of the learned attention weights on examples could have been nice for interpretability. Clarity: The paper was well organized. Sections follow the usual order of NIPS papers. The proposed approach is well described. The authors did not correctly fill the reproducibility form by answering No to all questions. Few aesthetic comments: Use \tanh, \text{sigmoid} and \text{softmax}. Equation 2 : The usual notation for the L1-norm is a subscript 1 instead of ${|\ell_1}$. Equation 10 and Line 146 : Use \left( and \right). Line 117 : Undefined section label. Line 237 : Did you mean {-1, 1} ? Significance: Notwithstanding my previous comments, the results obtained by the authors are really promising. I am sure this paper will motivate other researchers to follow their path and use other activation functions in attention mechanism. The fact that only changing the activation function can improve drastically the results also might create a subfield researching the effect of activation functions in attention mechanisms. Post rebuttal response ----------------------------- The author responded well to my concerns.