NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:8001
Title:Are Disentangled Representations Helpful for Abstract Visual Reasoning?

Reviewer 1

Thanks to the authors for the response. I thought the paper would be really cool and instructive IF the abstract tasks made sense. Based on the author's response and R3's review, these are pretty standard tasks I guess. I don't know much about these RPM tasks, so I will downgrade my confidence to a 2. The example tasks are still a bit confusing to me (compared to if you just look at an RPM example on wikipedia), but I guess once you see the answer to a few, you get the gist of them. Moreover, intuitively it seems that you do need to represent the factors of variation to be good at these tasks. My other concern is the one R2 raised, which is that if you need ground truth to pick out a good disentangling model and if ground truth helps a lot for directly solving the disentangling tasks then 1) why do we need disentangling for AVR? and 2) how would a AVR practitioner without ground truth label information benefit from these results? I think that despite the issues with this approach, I agree with the authors' point that for the sake of the study we can suspend some skepticism about how we get to these representations. Also the most important point the authors raised was that it is of high relevance to validate the motivation of the 20+ preceding disentangling papers by actually measuring how well disentangled representations do on downstream tasks. I wholeheartedly agree (and think more people should be doing this), so I will upgrade to a 7 despite the flaws in the experimental setup. ----------------------------------------------------------- They formulate two abstract visual reasoning tasks based on dSprites and 3dshapes. They then see how well different disentangled representations do when transferred to these tasks by training a relational model on top of these representations. They find that disentangled representations result in more sample efficient transfer to these abstract visual reasoning tasks, whereas in high sample regimes disentangling does not correlate much with upstream accuracy. Strengths: * Well-written. Background and related work is really well explained * Really helpful to show reconstruction error’s correlation to the upstream tasks as it is often used as a proxy for good performance * A cool, instructive result that disentangling is sample efficient * An unsurprising, but also instructive result that any entangled representation that captures most of the data (low reconstruction error) does well in upstream task when one has a lot of labelled data. Weaknesses: * Really only one real takeaway/useful experiment from the paper, which is that disentangling is sample efficient for this strange set of upstream tasks. * I have a lot of problems with these abstract visual reasoning tasks. They seem a bit unintuitive and overly difficult (I have a lot of trouble solving them). Having multiple rows and having multiple and different factors changing between each frame is very confusing and it seems like it would be hard to interpret how much these models actually learn the pattern or just exploit some artifacts. Do we have any proof that more simpler visual reasoning tasks wouldn’t do and this formulation in the paper is the way to go? * It seems weird the authors didn’t just consider a task with one row and one panel missing and the same one factor changing between panels. Is there any empirical evidence that this is too easy or uninformative? Why not a row where there are a few panels of the ellipse getting bigger and then for the missing frame the model chooses between a smaller ellipse, same size ellipse, *bigger ellipse*, bigger ellipse but at the wrong angle, bigger ellipse, but translated, bigger ellipse but different color, etc. or at least some progression of difficulty starting from the easiest and working up to the tasks in the paper?

Reviewer 2

Originality This paper does not focus on developing a novel method. All disentanglement methods have been previously proposed. The WReN that solves the abstract reasoning tasks is also an existing method. Simply combining these methods does not seem novel. Quality I have concerns about the methodology adopted in this paper. The paper focuses on discussing the relationship between the accuracy of the abstract reasoning tasks and the disentanglement score. However, disentanglement scores can only be computed when the ground-truth factors of variation are available. If ground-truth factors are available, then we can directly use the ground-truth factors to train WReN and achieve excellent performance, as shown in Figure 2, or we can train regressors/classifiers that predict the ground-truth factor before training WReN; but we do not need disentanglement learning. If ground-truth factors are not available, then we can not compute disentanglement scores, and we are not able to utilize the results are shown in Figure 3, 4 and 5 to select the best disentangled representation. Therefore, It looks to me that disentanglement learning is not very helpful in abstract reasoning tasks. Clarity This paper is well-organized and not difficult to follow. Significance The details are provided in Section 1. I think the contribution of this paper would be reasonable, if the authors can address my concerns about the methodology. Minor issues It looks to me that the word "up-stream" in this paper should be changed to "down-stream"

Reviewer 3

The paper conducts a large-scale study of the performance of disentangled representations on upstream abstract reasoning tasks. The abstract reasoning tasks use the methodology of Raven’s progressive matrices but use samples from dSprites and 3dshapes as the skin, with some modifications. Wild Relation Network is used as the upstream model, which would use representations learned by the models under comparison: beta-vAE, FactorVAE, beta-TCVAE, DIP-VAE, and variants of these which improve on it. There are many small bits of useful information in the paper, such as the fact that metrics which measure modularity as opposed to compactness perform better in the upstream task. However, the main conclusion of the paper is that disentangled representations, in general, do enable sample efficient learning in low-sample regimes as compared to learning from scratch. I wish the analysis could have been clearer and mode space was dedicated to it. I don’t fully understand how gradient-boosted trees or logistic regression were used as points of comparison. The first three pages are not very information-dense and perhaps should be compressed so that we get to the good stuff faster. Similarly, small details about the dataset generation could have been moved to the appendix. However, overall the paper is well-written and my criticism on clarity is minor. The paper tackles a very important question on representation learning and provides interesting new insights about it.