NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:53
Title:Unsupervised learning of object structure and dynamics from videos

Reviewer 1

To my knowledge, the concept of performing video frame prediction from a sparse keypoint representation is novel. However, in my opinion, the technical novelty in the paper is somewhat thin. The unsupervised keypoint detector, proposed in [12], is being reused with little modifications. Same thing can be said about the VRNN model. The differences that I can see are the temporal separation loss and keypoint sparsity loss in keypoint detector training. However, without enforcing temporal consistency and matching (i.e. optical flow or tracking), the keypoints can "jump" around between frames, how do you deal with this problem? Second, choosing a large k (sparsity loss) at the beginning will interfere with the temporal loss, did you observe any issues in the training process? Finally, what is the size of the feature vector in CNN-VRNN? My assumption is that the size of the feature vector in CNN-VRNN would be smaller than a K x 3 vector of keypoints, thus the performance gains could come from a increase in information being stored in the feature vector. Other than that, I think the authors did a great job to evaluate their method through detailed experiments. I have no comments about the quality of writing and presentation. Although I would prefer a stronger line stroke for the proposed method to emphasize the results in the plots (Figure 3, 4, and 7), and use box plots for Figure 3 and 8. Figure 3 (bottom left) has a small text (XID: 5985036) in the plot.

Reviewer 2

In this paper the authors propose a novel deep architecture for predicting future video frames. This builds on a variational recurrent neural network and combines several novel features including a keypoint representation and a best of many prediction scheme. The authors validate this on several datasets showing improvements over competing approaches. Overall this is a reasonably good paper - most of my comments are relatively minor. - It is hard to see how the keypoint construction approach actually imposes spatial structure. My feeling is that this should be very sensitive to initialization, but this seems not to be the case. Can the authors explain how this can be so? - Some papers (e.g. ref [27]) suggests that adversarial training may improve prediction. Why do the authors not compare to that? - Line 124: can the authors provide more intuition about why this should be the case? did the authors also try with the linkkng the two modules?

Reviewer 3

Originality: The main contribution of the paper is to propose a structured representation for video prediction models based on extracting keypoints from images. Models that extract keypoints from images had been proposed before, and here the authors propose an extension of those ideas to video. The paper also has experiments to empirically analyze this representation, which is often lacking in other video prediction papers, despite the fact that learning representations is one of the main motivations for video prediction. Clarity: The paper is well organized and clearly written. Quality and significance: The experiments are sound and properly assess some of the points made by the authors. I believe there are some issues/typos with the model formulation. The likelihood term in equation 2 should not include x_t in the condition (otherwise p(x_t|x_t, other RVs) is trivial. Similarly for equation 4. I also think that the claims that this model avoids compounding errors and that it has efficient sampling are a bit misleading/not properly supported. As they are described at the moment they are not particular to using a keypoint representation. Instead, they are caused by the decoupled training of the encoder-decoder and the VRNN dynamics model, but this could be done also with unstructured frame representations. I also think that there should be an ablation of the different extra losses in the main paper. The experiments show an improvement using the 'best of many' technique. In the supplementary material figure S1 shows that there's a slight improvement for downstream tasks using the different losses. However, this should be quantified for video prediction too. Also note that some of the extra losses (sparsity of the representation) could be adapted to unstructured representations, and I wonder how would this change the results for downstream tasks. As for the experiments and their results, I think some conclusions extracted by the authors are a bit unjustified. For video prediction (figure 3), in FVD the model has a clear advantage, but not so much in terms of VGG cosine similarity, where it seems that the model's performance has a lot of variance (best in terms of closest sample but worse than baselines in terms of the furthest sample). Note that at sampling time without ground truth having such high variance could mean that some samples are very poor. While the authors argue that this is a sign of increased diversity, this is not trivial nor properly supported. In practice it means that some of the samples generated by the model deviate quite a lot from the ground truth. Furthermore the authors do not compare to other contemporary state of the art methods such as SAVP [13], which obtains significantly better FVD scores than the SVG baseline. It would be interesting to include such comparison. In general, despite the above issues, I believe they do not significantly alter the conclusions from the experiments and therefore I am in favor of accepting the paper for its novelty and positive results. Minor notes: Figure 7 graph 1 has wrong x/y limits, the lines go out of the plot. line 126 typo: 'ideally, the representation should (missing verb) as few keypoints...' ------------- POST REBUTTAL UPDATE ---------------------- I read the rebuttal and the other reviews. Some of my comments such as the one regarding compounding errors have not been addressed, however I still believe the paper is a good contribution and keep my rating.

Reviewer 4

Originality: * The approach seems novel. * Related work seems adequately cited. Quality: * The method seems technically sound though the architecture details should be presented in the main paper. Some issues: - l. 223, it is not clear if the RNN cells have memory/hidden states and this affects the ability to alter the prediction by changing the keypoints. Please add an explanation why it can work. - Eq 1, rhs should be negated for a loss. - l. 191, the larger diversity of quality of the samples is a shortcoming of the proposed method. Apparent from Fig. 3 left is that the proposed method can perform significantly worse for some samples than other sota methods. A plot also showing mean/std dev for the variants might be insightful, please add. How to choose good samples at test time? Please discuss. * Sec 5.2, a more finegrained evaluation would be interesting here, if the learned keypoints can capture all objects in the scene, or if it chooses to represent only a few and which. Does the prediction error contain outliers for some objects? * Please define or revise the term "object structure" in the title. The title is too generic, please mention keypoints in the title. Clarity: * The paper is very well written and easy to follow. Technical details about network architecture should be included in the main paper. Significance: * Latent dynamics representations of video are highly relevant for video prediction and planning. The proposed approach is interesting, since it suggests a way to add structure or inductive bias by predefining that the learning algorithm needs to use a number of localized keypoints for describing the video content.