Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Originality I think this is very original, and is an elegant way of adding the known physical process of image formation to the deep learning setup that GQN proposes. This is a natural yet original extension. Quality This submission is high quality. The problem is well presented and explained, the link to epipolar geometry feels very smooth and the way that it is incorporated in the system is clean. This works is self-contained, and well presented. Clarity The paper is well written and all the concepts are very clear, I have nothing negative to report on clarity. Significance This is quite significant as it proposes a tool that incorporates epipolar geometry to a deep learning architecture, and it applicable beyond the scope of GQN. If the camera intrinsics and extrinsics are known in a multi-view problem such as the one presented in this paper, I do not see a reason why this technique should not be used.
Epipolar geometry describes the relationship between views of the same scene captured by a pinhole camera. The authors refer to the classic reference Hartley & Zisserman's Multiview Geometry book for the details, but clearly explain the fundamental properties they leverage in the proposed scene encoder design. Namely, given two calibrated views of the same scene, for each point in one view, if a match can be found in the other view, it will be located along a line whose equation is a relatively simple function of the corresponding relative camera poses. That is, if the relative pose of two cameras is known, the search for a matching point between two frames is a 1D search problem. The authors exploit this fact to reduce the amount of data that a GQN scene encoder needs to ignore/discard when merging information from multiple view into a sensible scene representation, in two steps. After the context frames are encoded, given the query camera: * epipolar lines are extracted and stacked; * search is implemented as an attention mechanism. By using this architecture the encoder doesn't have to learn to approximate multi-view geometry and can use its capacity to solve a simpler matching problem and to fuse information into a more coherent representation. The architecture is described in full detail, and even without source code I am confident the new model could be reimplemented fairly easily (in comparison reimplementing Conv-DRAW would be much harder). The authors show that EGQN can consistently outperform vanilla GQN on a subset of the original benchmarks, and that with the new architecture let them scale the model to capture more complex scene geometries. To this end, the authors introduce 3 novel datasets, and commit to make them available (although the website the point to for extra results in not operative). One aspect of the experimental setup is disappointing: the Maze dataset (available alongside the other datasets) is not used for comparisons. I would have expected this setup to be particularly interesting, and a more difficult one for EGQN as many of the frames are collected by RL agents walking down corridors. The peculiar relative pose of forward motion is a corner case that perhaps cannot be handled by the model (specific implementation)? There is one detail of the presentation that I found confusing: the abstract at line 7 reads 'requiring only O(n) comparisons per spatial dimension instead of O(n^2)'. The statement is reiterated and somewhat clarified at line 34, but then it's not further discussed in the text. Whilst I do have an intuition of what the author means, I think it would be helpful to explicitly clarify the subject in the discussion of the algorithm.
The idea is original, the paper aims to improve a simple element-wise sum of context representations as in GQN. The key idea is based that there should be a consistency between the viewpoints, i.e. for a given pixel we can cast a ray from the pinhole camera, then from a different viewpoint this pixel colour should lie somewhere on the casted ray (yet can be occluded). The output of the model should then depend only on the relevant features, i.e. for a given pixel only on the rays that pass though it. The overall network and attention mechanism (Fig2 and Fig3 b) is complex and have some extra unexplained engineering. I did not understand the approach properly from the descriptions. For example is e^k aggregated for all observations, or separate, is the attention mechanism looking at all of them together? Where z_i variables come from, seems these are not specified. In Fig2 it is unclear which operations are done per observation and which across all. In Fig3b, what is d, what is its value any why? Why e^k dim is 2 times h' and one time w' if h/w (height/width) of an image should be symmetric? Regarding the results, the EGQN results on the new datasets are much better than GQN. This is especially visible in Fig6 where only EGQN gets the colours and letters of the box correctly. Similarly, EGQN performs well on disco humanoid, and GQN performs poorly. The original datasets are very simple and here the results are similar, only sm7 is more challenging from the original ones are here the performance is again much better. Fig8 ground truth is the same as the last column of the context, and predictions are different, it seems GT is incorrect. Would be good to explain each variable in the caption of each figure, not only in the main text. The evaluation is only one table with 2 rows. Would be good to show other measures, and for this table, the presentation could be richer, e.g. show violin plots with std/median depicted too. Surprisingly exemplary images are not given for all datasets at least in the supplement. Fig4: What is +1.254 on top of 3 plots meaning? Fig4 minimum y-value looks suspicious -- why is the value constant for 4 datasets and then for the 3 other ones? The method is interesting overall but evaluation seems to be not adequately explored, visualised, and presented or even provided. There are some issues raised that hopefully will be explained in the rebuttal.