Reviews: Adaptively Aligned Image Captioning via Adaptive Attention Time

- The paper combines the idea of Adaptive Computation Time (ACT) and multi-head attention build an attention mechanism called Adaptive Attention Time (AAT). Although the two techniques have been well explored individually, this is the first work combining it for attention for image captioning. - The paper is also clearly written with explanation of all the hyper-parameters used in the paper. This should make reproducing the results easier. - It is not clear what is the contribution of AAT compared to multi-head attention. The base attention model already is doing much better than up-down attention and recent methods like GCN-LSTM and so it’s not clear where the gains are coming from. It’d be good to see AAT applied to traditional single-head attention instead of multi-head attention to convincingly show that AAT helps. - More analysis is required to find the reason for improvement from recurrent attention model to adaptive attention model. For instance, how does the attention time steps vary with word position in the caption? Does this number change significantly after self-critical training? - How much does the attention change over multiple attention steps for each word position? From the qualitative results in the supplementary, it’s not clear, how is the attention changing from one attention step to another. - Is it the case that self-critical training is necessary to fully utilize the potential of AAT. The gains when trained just using Cross-Entropy Loss are minimal. Even for self-critical training, the gains in other metrics (SPICE, METEOR ) are minimal. Minor Comments: - In Line 32, the paper says that words at early decoding steps have little access to image information. Why is this the case for traditional models? Doesn’t every time step have access to the same set of visual features? - Are the ablations in Table 1 done on the same split as Table 2? Update [Post Rebuttal]: The authors addressed several of my concerns through experiments and answered many of my questions in the rebuttal. They showed that AAT helps even in case of single-head attention; that self-critical training is required to fully optimize the potential of AAT; Fixing attention steps introduces redundant or even misleading information since not all words require visual clues and that increasing the min number of steps reduces performance supporting the claim that adaptive attention time works better than recurrent attention. In light of the rebuttal, I am increasing my rating to 7.

Authors properly justify that for image captioning several attention steps (in the decoder) is reasonable. Also fixing the number of attention steps as in recurrent attention modules does not yield the best results. Their target task is image captioning. The model architecture that they use for encoding image is a standard Faster-RCNN pre-trained on ImageNet and Visual Genome. For the decoder they use a attention-based LSTM model. They augment the attention module by outputting a confidence score (through an MLP on hidden state) at each step and halting the recurrent attention as soon as the confidence score drops bellow threshold. They use similar loss as ACT (graves, 2016) to encourage model toward fewer steps. In the end by allowing their model to take between 0-4 attention steps, they have an average 2.2 steps, while getting better performance in compare to a recurrent attention baseline with 2, 4 or 8 steps (fixed). Their ablation study is helpful as it clarifies the effect of the loss (scst vs ce), number of attention steps, and the lambda factor for the act loss. *********************************************************************** Thank you for answering the comments. I still believe this is a grounded and strong work and I will keep my score at 7.

Paper ID:	4799
Title:	Adaptively Aligned Image Captioning via Adaptive Attention Time

Reviewer 1

Reviewer 2

Reviewer 3