NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3239
Title:Learning Dynamics of Attention: Human Prior for Interpretable Machine Reasoning

Reviewer 1

- The paper presents a novel approach of getting attention at a given time by modeling infinitesimal change of attention as a continuous function. They extend the idea proposed in "Neural Ordinary Differential Equations" and implement in the MAC framework. They show that this proposed approach can get comparable performance by taking 1/3rd the steps. - I enjoyed reading the paper and detailed experimental analysis. The paper is clearly written and the take-aways from the experiments are clearly outlined. I specially liked the qualitative analysis on interpretability by discussing chunkiness of attention maps, consistency across different seeds and the dynamics of the interpolation between different attention steps. - The paper lacks discussion on the advantages of continuous attention mechanism instead of discrete attention mechanism such as Adaptive Computation Time [1]. Approaches like [1] has also shown reduction in computational complexity / attention steps while preserving performance. - The paper also provides enough technical details (both in supplementary and main manuscript) for easy reproducibility. The code provided should also help in that aspect. - The proposed method should help in building computationally efficient algorithms for visual reasoning. Additionally the proposed metric to measure focus drift for model might also be useful to reason about how the attention changes over time. [1]Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016. Update [Post Rebuttal] Overall I thought that the paper is of a high quality. I found the idea using Neural ODEs in the MAC framework to model attention as a continuous system is definitely interesting. After going through the discussion and additional experiments in the rebuttal, I believe that the authors gave further evidence on the significance of their work and the generalizability of the approach. Therefore, I am sticking to my original decision.

Reviewer 2

The authors attempt to imitate human cognition, allowing MAC attention to have continuous reasoning. This is achieved by incorporating neural ODEs into the logits calculations per time stamp. This is original idea. I find the writing facaniting, and the overall quality high. I was really excited to hear the paper inspiration is human cognition and continuous reasoning. It is important research goal to take inspiration from human behaviour. The issue with this work is the significance. The model and evaluation relies on a single model(MAC), and not used in other types of attention (e.g., Glimpses, Transformer Heads). This also limits the impact to a specific task (CLEVER), and the overall CLEVER accuracy stays the same. The paper also discuss a new evaluation method(TLT), which finds the amount of attention shifts. But this evaluation can only be tested with MAC-like models, which is not likely to remain a state-of-the-art model.

Reviewer 3

Post-rebuttal Comments: After reading the other reviews, the author's response, and the reviewer discussions, below are my thoughts: (P1) The direction of the paper is definitely interesting. (P2) However, I am not completely convinced that TLT is necessarily a good metric for interpretability. (P3) Given that the benefits of the proposed approach are mostly in TLT and with the above point, it is not clear if this leads to better interpretability. On a side note, I would like to thank the authors for a strong rebuttal -- including experiments on GQA. However, concerns (P2) and (P3) still remain. Therefore, I am sticking to my score. General comments: (G1) The paragraph explaining the relation of TLT to Occam’s Razor (L32-L41) sounds very philosophical without supporting evidence (either through related work or studies). Further, arguments for why TLT is a metric of interpretability is not convincing (L32-L41). There are no further empirical or human studies performed to established this clearly. (G2) How does the system handle discontinuous attentions, where it makes sense. For instance, to answer the question: ‘Are there equal number of cubes and spheres?’, human would (i) find all the cubes (attention_cube), (ii) find all spheres (attention_sphere), and then compare their counts. Intuitively, attention_cube and attention_sphere needs to be discontinuous. From what I understand, the ODE solver discretizes this sudden shift as an interpolation between two time steps (t=1 and t=2). If this is true, isn’t this in disagreement with the proposed idea of continuous attention change across steps? (G3) A major concern is insufficient experimental validation. The proposed approach has similar performance as prior work. Most of the benefits are obtained on the TLT metric. With the problems in (G1), contributions feel not sufficiently backed with empirical evidence. (G4) Additionally, the paper does not contain any experiments on any of the visual question answering (VQA) real datasets. Without these experiments, it unclear if models with proposed attention regularization across steps has benefits on real datasets (and therefore applications). This is another drawback of the current manuscript. (G5) The manuscript does not talk about the additional cost of running the ODE solvers including run-time analysis and comparisons. Typos: L44-45: Grammatical error in the sentence needs a fix.