Paper ID: 658 Unsupervised Learning of View-invariant Action Representations

### Reviewer 1

The authors present an unsupervised learning framework to recognize actions in videos. They propose to learn a representation that predicts 3D motion in other viewing angles. They leverage a view-adversarial training method. They run experiments on multi-modal datasets for action recognition. Pros: - Unsupervised learning for video is a relevant problem - The suggested surrogate task that learns view-invariant representation is interesting. - It s interesting to see that their proposed method based on flow input outperforms other methods... Cons: - l25: The claim for view specific is debatable. Some of the previous works, e.g.. [31], models 3D information to be view invariant…Their performance is the same the proposed method although the authors report different number (see next bullet).  - In table 3&4, why are the authors re-implementing [31] and [32] since the exact numbers are available in the original papers for the same used datasets? In fact, the authors are reporting numbers for other methods AND NOT for these 2 ones. Why not do the same for 31 and 32? At least, the authors can share the reported numbers as well as the ones with their own re-implemented version. Discarding numbers is not correct. Similarly for table 4, the performance for [4] is different from previous papers. Why is it 80 instead of 85? That again communicates a questionable performance evaluation. Because of the above weaknesses, I have strong doubt on the performance evaluation section. I have identified few inconsistencies. Hence, I am not confident that the proposed method outperforms previous methods.

### Reviewer 2

This paper addresses the problem of view-invariant action representation within an unsupervised learning framework. In particular, the unsupervised learning task is the prediction of 3D motion from different viewpoints. The proposed model comprises four modules: "encoder", "cross-view decoder", "reconstruction decoder" and "view classifier". For training purposes, a loss function defined as the linear combination of three task-specific losses is proposed. Given an encoding, the cross-view decoder is in charge of estimating the 3D flow in a target view different of the source one. Then, the goal of the reconstruction decoder is to reconstruct the 3D flow of a target view given the encoding for the same view. And, finally, the view classifier is trained in an adversarial way, i.e., the view classifier tries to classify the given encoding of the input, while the adversarial component promotes the learning of view-invariant encodings. To deal with temporal information, a bi-directional convolutional LSTM is used. Three modalities have been considered during the experimental validation: RGB, depth and flow. The experimental results on three benchmarks demonstrate the validity of the proposed framework. Overall, in my opinion, this is a high-quality work of wide interest for progressing in the "unsupervised motion representation" problem. * Strengths: + The problem of unsupervised learning of 3D motion is of interest for the computer vision community in general, and for NIPS community in particular, as it has been already addressed in top-tier conferences and journals. + It is very interesting the combination of the idea of using cross-reconstruction of 3D flow and adversarial training for view classification from the encodings. + The paper is well-written and easy to read. + Competitive results are obtained on three standard datasets for action recognition: NTU RGB+D, MSRDailyActivity3D and Northwestern-UCLA. The experimental results support the viability of the proposal. * Weaknesses: - In my humble opinion, further implementation details would be needed to be able to fully reproduce the presented results. * Detailed comments: - Sec. 3.1, in Encoder, how has the value of $T=6$ been selected? How does this value influence in the performance of the system? - In Tab. 3, column 'Cross-subject', row 'Skeleton', I guess number 79.4 should be highlighted in bold. * Minor details: - Page 3, line 89: extra space is needed after "...[55]. " - Page 4, line 152, "...other then i..." --> "than" [About author feedback] I have carefully read the author response and my positive "overall score" remains the same.