Reviews: Learning Deep Bilinear Transformation for Fine-grained Image Representation

Updates: I appreciate the authors' effort to provide additional experiments. Now I understand the difference and significance of the proposed grouping method compared to MA-CNN, as well as its effectiveness on large-scale ImageNet dataset. On the other hand, I still feel the core methodological contribution is marginal because pairwise clustering is an well-studied approach to clustering in general, and I think switching to it is not a surprising direction. Overall, I would like to set my final evaluation as borderline negative. This paper presents a method combining bilinear pooling and channel grouping for fine-grained image recognition. Overall, the idea itself is quite reasonable as it can significantly reduce the dimension of resultant features, and the results also seem promising. However, because the idea and effectiveness of channel grouping for fine-grained recognition are proposed in MA-CNN [9] and not novel, the contribution of the paper seems rather incremental: just performs the standard bilinear pooling on top of the grouped features. Moreover, the presentation of the paper is not very clear and I cannot see some crucial points as I commented in "5.Improvements". For those reasons, my initial impression is leaning toward rejection.

Reviewer 2

This paper proposes a bilinear transformation module which combines semantic group of feature channels and intra-group bilinear pooling. This module does not increase the number of channels while improves non-trivially the network performance. As far as I know, the proposed bilinear transformation is clearly different from existing works. The semantic grouping layer is important in the proposed bilinear transformation module. How is the constraint as described in Eq. (4) is implemented as a loss? Will such a constraint will be imposed and implemented in the final loss for every residual block (i.e. bottleneck block)? Note that in ResNet-50 (resp. 101) has 16 (resp. 33) residual blocks. I suggest the authors make these clear and write out explicitly the formula of the whole loss. The experiments are basically extensive and the results are competitive. Nevertheless, I have some questions which hopefully could be addressed by the authors. In Table 5 stage 3+4 (w/ shortcut) produces an accuracy of 85.1, while in Table 6, last layer+stage V+stage IV produces the same accuracy. Is the accuracy of 85.1 obtained by combination of exactly what layers? What does last layer in Table 6 indicate? Are the proposed models pre-trained on large-scale ImageNet? If so, I would like to know the comparison of recognition accuracy on ImageNet with vanilla ResNet and other second-order pooling methods. I would also like to know the data augmentation method used. Is the random crop used in training? The proposed bilinear transform can be inserted into lower layers of deep networks, different from the classical the bilinear modules which can only be inserted at the end. In this aspect, there are some related works missing in the paper, e.g., Factorized Bilinear Models in ICCV 17 and global second-order convolutional networks in CVPR 19. ------------------------------------------ In the rebuttal, the authors have given explicitly the proposed loss function, making clear the mechanism of semantic grouping as well as its difference from the grouping method of MA-CNN. The performance on ImageNet is very impressive, showing that the proposed method is not only suitable for fine-grained classification but also general, large-scale visual recognition. I suggest the authors add the modifications into the manuscript; some typos or grammatical errors should also be corrected (e.g., line 76, aliment-->alignment; line 221, which a vital-->which is a vital). In summary, the paper proposed a compact bilinear pooling layer that can be inserted into throughout a network, clearly different from previous bilinear pooling methods. The performance is also very competitive, on both fine-grained classification and large-scale visual recognition.

Reviewer 3

Clarity. - The paper is well-written and the method is well-explained. Originality. - The proposed extension is very well-motivated, but the paper itself is incremental. As the major contribution seems from limiting pair-wise interactions within each semantic group. Significance - The proposed extension to bilinear transformation is well-motivated, and, to certain extent, combines advantages of bilinear pooling and part-based learning. - The proposed method looks a good practical approach for very good fine-grained classification results. =========================== Thank authors for the response! I have read the author response and my opinion remains the same, as I still feel the contribution seems a bit incremental over existing work.

Paper ID:	2401
Title:	Learning Deep Bilinear Transformation for Fine-grained Image Representation

Reviewer 1

Reviewer 2

Reviewer 3