Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper makes some observations about the word2vec and glove algorithms, specifically in light of their connection to factorizing the PMI matrix of co-occurrences. However it's not clear how the contributions are relevant, beyond them being a list of observations. Some sections explain the empirical relevance of these contributions, but the arguments are often convoluted (see contributions section). I also believe the claim made in the paper that much is left unknown about word2vec/glove (specifically what is learned and why is it useful) is exaggerated and in fact a lot of subsequent work has answered these questions. In particular the success of pre-embedding, unsupervised distributional methods is not new at all (see, e.g. the overview paper by Turney et al, 2010) Specifics: - Some parts of the paper simply re-iterate previous work, such as the beginning of sections 4 and 5. - How was the LSQ method in Table implemented exactly? - Please clarify 4.2 and 4.3 including the use of terms similarity and paraphrasing
This paper provides a new view of word embeddings. This paper introduces the notion of global relatedness, which is constructed by PMIs between the specific word and other context words. This paper shows the global relatedness can capture various semantic similarity, by considering geometric and probabilistic aspects of such vectors and their domain. Moreover, this paper shows low dimensional word embeddings built by word2vec or Glove can be viewed as the linear projection from the global relatedness vectors. This paper is well-written and easy to follow. This theoretical contribution is novel and provides a new tool to understand why word embeddings can capture various semantics of words. The originality and quality of this paper would be above the threshold.
This paper's view is novel and relatively solid. It provides a perspective for understanding the semantic similarity in word embedding, by (1) showing via space geometry that different semantic compositionality can be captured by PMI vectors (2) the linear projection between the PMI vectors and word embedding can preserve properties in (1). To me, the best part of the paper is that the author makes an effort to give a systematic and mathematically well-formed analysis addressing the frequently mentioned but not fully understood semantic issues in word embedding. The paper also derives a new model with LSQ loss in section 5 which achieves better performance and thus justified the previous analysis to some extent. My biggest concern lies in the absence of the understanding of COSINE similarity. If I understand the paper correctly, in section 4 the discrepancy between two PMI vectors is measured by their abstraction (\rho in Eq. 8 and \epsilon in Eq. 9), which is close to Euclidian distance rather than cosine distance (two vectors may have cosine similarity of 1 but can be very far from each other from the "abstraction or Euclidian" perspective). However, the cosine similarity is the most popular measure in practice (not only wording embedding), and since the paper claims an additive preserved linear transformation between the PMI vector and the word embedding, the cosine-related properties should also exist in PMI space and the author should also give a fair discussion there. Especially in section 4.3, the entire analysis based on the surface S does not suggest any property related to cosine similarity and thus is a bit skeptical whether this is really the case in practice. Even the results in Table 1 use cosine similarity as the measurement (as shown in Appendix F). Minor issue: the writing style could be improved. Specifically, (1) some of the explanations and statements in Section 4 and 5 are tedious and not easy to follow (e.g. a bunch of short equations within sentences which could be simplified). (2) Notations. It would be clear if using a single uppercase letter to represent the matrix, rather than "PMI", "SPMI" which look like matrix multiplication.