NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2853
Title:Learning Disentangled Representation for Robust Person Re-identification

Reviewer 1

This paper describes an approach to person re-identification that uses a generative model to effectively disentangle feature representations into identity-related and identity-unrelated aspects. The proposed technique uses an Identity-shuffling GAN (IS-GAN) that learns to reconstruct person images from paired latent representations even when the identity-specific representation is shuffled and paired with identity-unrelated representation from a different person. Experimental results are given on the main datasets in use today: CUHK03, Market-1501, and DukeMTMC. The paper is very well-written and the technical development is concise but clear. There are a *ton* of moving parts in the proposed approach, but I feel like the results would be reproducible with minimal head scratching from the written description. A few critical points: 1. The FD-GAN [27] approach is indeed very similar to the technique proposed in this submission. So much so that it is surprising (to me) that the IS-GAN outperforms it so handily. Some more analysis as to why this is the case would be useful. 2. Related to this, MGN [12] is the only approach outperforming at times IS-GAN. Deeper analysis of this and a discussion of the different affordances IS-GAN and MGN might have would help interpretation of the quantitative comparison with the state-of-the-art. 3. A grid search over six continuous parameters seems expensive. What data is this performed on? How do you set aside validation data for model evaluation? This seems to be a critical point if independent training/validation dataset isn't used for this. 4. Back on the topic of FD-GAN, the results on CUHK03 reported in [27] are significantly higher than those reported here. These FD-GAN results are omitted from Table 1, and I assume this is due to differences in experimental protocol? Some mention of this is probably warranted given the similarity of the approaches. POST REBUTTAL: The authors have addressed all of my main concerns and my decision remains to accept. I encourage the authors to include the rebuttal description of differences with FD-GAN, the hyperparameter search procedure, and the interpretation of the results with respect to WGN.

Reviewer 2

(1) Originality: This work shows a new data generation method, part-level identity shuffling technique. However, learning disentangled representation is not a new method in person reID field. ID-related or -unrelated feature learning is also similar to previous works, e.g. DG-Net ([1] Joint Discriminative and Generative Learning for Person Re-identification). ID-related or -unrelated feature is a different kind representation form of the structure and appearance features in [1]. The training strategy and most of loss functions of these two works are similar too. (2) Quality: The claim of this work can be supported by the ablation study in the experiments. The method achieves good performance on the widely-used dataset. There is no analysis of bad case, but it does not matter. (3) Clarity: Well-written. It’s easy to read. (4) Significance: This work may inspire future work to a certain extent, but it is limited to design new generation methods. The main contribution of this paper, disentangled representation, is similar to previous works.

Reviewer 3

This paper proposes a new method to learn robust representation for person re-identification problems by separating features for human identity and the others via learning generators of human images. Probably it learns GAN-based models to generate generalized human images to be robust in variations of pose and occlusion. The idea is very inspiring for applying it to other similar visual surveillance applications such as of view-point invariance or outfit-invariance. This paper is well-written and concisely focused on the main goal. However, it needs more detailed explanations for reproducibility. Specifically, the part of domain discriminators is not clear. What is the meaning of the sentence 'we add convolutional and fully connected layers for the class discriminator."? How to configure patchGAN for them? In subsection 'visual analysis for disentangled features', it would be helpful to show shuffling of two people with different colors and styles of dress as general cases for a deeper understanding of the proposed methods. I'm curious what kinds of effects are shown about camera angles, viewpoints, or outfits if possible. Here are minor comments: There are some typos and grammatical errors. It would be helpful to proofread by native speakers. In Figure 4, the box boundaries in green or red are too thin to be seen clearly. It would be better to make them thicker a little. line 26: focussed -> focused line 38: argumentation -> augmentation POST-REBUTTAL: The authors addressed all of my concerns. I've raised my score 1 higher. I am asking the authors to merge the materials of the rebuttal into the manuscript.