Reviews: Hierarchical Optimal Transport for Document Representation

Originality: The authors clearly distinguish their work from previous efforts in the related work section. TMD seems to be the most similar work, at least thematically, but was not included as a baseline. Quality: The results and proofs are thorough. Classification is run on a variety of datasets against reasonable baselines. The classification performance numbers are supplemented with sensitivity analysis, runtime information, and two additional tasks (t-SNE visualizations and link prediction). The proposed approaches work best on average across all datasets, but they only beat all baselines on 1 of the 8 datasets, a fact that could use more discussion in the paper. Clarity: The work is well-motivated, and the authors do a good job of anticipating reader questions. Some smaller points were unclear (see "Improvements"), but overall well-written. Significance: The good results on long texts (especially the full novels of Project Gutenberg) and focus on computational efficiency are especially promising as research continues to push towards using longer and more complex texts.

Reviewer 2

* Originality: The paper proposes a nice idea for speeding up WMD, using a hierarchy. This idea although was applied to several problems long time ago (e.g. hierarchical softmax) but it is novel for this problem. * Quality: The paper presents throughly properties of HOTT. Experiments showing that HOTT does work. * Clarity: The paper is NOT difficult to read to capture the main contributions. However, I did not check the math carefully. In the equation in line 95, what are deltas? * Significance: Although the main idea is nice, its implementation is heavily relied on pruning LDA's topics to make |T| and the number of words in each t small. I found it very tricky, and was wondering what if we also use the same heuristic for WMD.

Reviewer 3

Applying the idea of Word Mover's Distance to topic distribution representation of documents is interesting. While the overall idea is good, the evaluation is not good enough. Although WMD is far inefficient compared with the proposed method, its performance is better presented in the experiment. Instead, WMD-T20 is taken as a baseline. It is not clear why 20 is determined. While each of the topic is represented by top 20 words, the number of topics are decided as 70. So, other size of truncated WMD model should be compared. According to Figure 4 (a), WMD-T20 attains lower error than the proposed method on GloVe. It shows a quite different performance on word2vec. Why does this happen? If GloVe is used throughout the experiments, what results are obtained in Figure 5? In the definition of HOTT on page 3, why do you need to use delta-function? Since \overline{d^i} is defined as a distribution, a definition like W_1( \overline{d^1}, \overline{d^2} ) looks to be enough. Or, please make the formulas more clearly either they are distributions on topics or words.

Paper ID:	913
Title:	Hierarchical Optimal Transport for Document Representation

Reviewer 1

Reviewer 2

Reviewer 3