NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 8672 Invariance and identifiability issues for word embeddings

### Reviewer 1

This paper was very clear, easy to follow and tackled an important topic under a new perspective. It gives a lot of insight on how embeddings are trained and evaluated, opening up space and motivating new research on this topic. Nonetheless, the paper needs more details on the experiments, I couldn’t understand on which data embeddings were optimized, for example. The paper should also give a clearer motivation for the choice of how embeddings were constrained in Section 4.1. In the moment I do not know if constrained embeddings would provide better, worse or similar results to the average non constrained ones. As an extra, I believe a few extra experiments (showing results for other embeddings, for example) would help. They would give a more palpable notion of how large the impact of varying embeddings could be. *What strengths does this paper have?* The paper is very well written. Although it is very mathematically grounded, it is easy to follow and understand. I believe even readers with a reduced mathematical understanding could (although maybe skipping the proofs) understand the paper. It explores a non-mainstream topic in a relevant task. It gives interesting insights on this topic which other work can build on. It proposes two different solutions to the problems they highlight in the paper. *What weaknesses does this paper have?* The paper states that there is a discrepancy between invariant regions of f and g. One of their solutions is to restrict the set of solutions V is allowed to take. Nonetheless, the authors do not show why this reduced set of solutions for V would be better than a random choice in the full set of optimal results for f. They also do not give evaluation scores of g to this restricted V embedding choice. This could be presented in Table 1. A second problem is that the authors do not give a detailed description of how \Lambda is optimized for the results in Table 1. Do they optimize this matrix on the same data D used later to get the scores g(D, V). If so, I believe this would be a critical problem. This optimization should be made on a set of data D’ which does not intercept with D. The paper had some extra space left and the code does not seem hard to test with other embeddings. Why not expand Table 1 to contain results for LSA and Word2Vec? *Detailed comments:* In Figure 1 (a, b), the red lines do not match the others when “alpha=0”. I imagine this is because you didn’t do “\Lambda ^ alpha * \Sigma * B” for this line, but only “\Lambda ^ alpha * B”. If this is true, I believe these plots would be more intuitive if it used the full option (“\Lambda ^ alpha * \Sigma * B”). In Figure 3, results for upper triangular matrices (b, d) seem to usually have better performance than diagonal ones (a, c). Would you have intuitions on why this happens? What is the red vertical line in these plots? Original V results? Typos or small comments: Line 116: Was the difference supposed to be between != ? Or was it supposed to be != ? Line 219: In addition to *being* orthogonal… Figure 2 and 3: These figures are leaking outside the margin. Figure 2: A legend could be presented visually together with the plots in someway. Figure 3: There is no xlabel and there is no range for the y axis.