Paper ID: 464
Title: Beyond Pairwise: Provably Fast Algorithms for Approximate k-Way Similarity Search
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper describes a nice trick to extend MinHash to estimate the generalization of the Jaccard index to more than two sets. Using the trick, the paper presents efficient search algorithms for (1) given S and T, indentifying U for which (S,T,U) are similar, (2) given S, identifying T and U for which (S,T,U) are similar, and (3) identifying S, T and I for which (S,T,U) are similar. There are experiments showing the value of a >2-way similaruty measure.

This paper is not written well and is consequently difficult to read.

As far as I know, the work is original. The algorithms are significant and may prove to be useful in many domains. If this paper is edited to remove all of the grammar errors, it will be very nice.

The anecdotes for the "Google Sets" problem are very cool. I found the search-quality experiments very hard to follow. It would be help a lot if the metric discussion is promoted to come earlier in the section and discussed in a little more depth.

Example presentation issues (restricting to one per page)

Page 1: "one of the most widely used notion" => "notions"
Page 2: "may be" => "maybe"
Page 3: "any two point" => "any two points"
Page 4: "One of the peculiarity" => "peculiarities"
Page 5: "c-BP" in the Section 5 title should be "c-CP"
Page 6: "if we are interested in very high similarity cluster" => "a very" or "clusters"
Page 7: "report top 5 words" => "report the top 5 words"
Page 8: "We formulated ... and obtain" => "formulate" or "obtained"
Q2: Please summarize your review in 1-2 sentences
I think this is a nice contribution that could have practical impact. It needs a serious edit by a native English speaker.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper extends the min-hash mechanism to k-way Jaccard similarity and presents a theoretical analysis following the framework of [Indyk & Motwani '98; Har-Peled et al. '12]. The work is in general solid and supported mostly by the theoretical analysis. Experiments demonstrate the advantage of k-way similarity over the typically pairwise methods.

Overall, I like this paper. Although the majority of the theorems and proofs are based on existing results, the authors have some interesting constructions (e.g., the f_1 and f_2 functions) and covered several aspects of the 3-way scenarios, including near neighbor, close pair and best cluster. One minor concern is that although the arguments of using k-way similarity is convincing, it practical value may not be demonstrated strong enough in the experiments. For instance, the Google set experiment is interesting, but instead of "Sum Resemblance", perhaps the correct function is something that can simulate the conjunction (i.e., w1 is similar to w AND w2 is similar to w). Similarly, in Table 2, the results would be stronger if the pairwise setting became a similar conjunction or average setting. Nevertheless, even if the enhanced pairwise similarity approach performs comparably to the k-way similarity method, the latter is still a more principled approach.

---

Thanks for your response to the comments.
Q2: Please summarize your review in 1-2 sentences
The proposed approach is a nice extension of min-hash similarity to general k-way Jaccard similarity with a pretty complete theoretic analysis. The experiments could be stronger to better justify the value of k-way similarity over the pairwise case.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper addresses the problem of k-way Jaccard-based similarity search. To do so, the authors show that minwise hashing is efficient to address the problem of k-way c-NN search. The development they propose is a direct extension of the one known for pairwise similarities. The extension to other efficient similarities through the PGF transformation is also a direct extension of prior work.

The work presented here is well conducted, clear and of good quality. At the same time, the focus on the Jaccard similarity for sets limits the applicability and the impact of the method. It is not common at all to use presence/absence representations in text mining applications (as the Google set and webspam problems) and the work would be more interesting if different similarities were considered. Regarding the experiments performed, for the Google set problem, I did not find the words obtained by the 3-way similarity really better than the ones obtained by the standard Google algorithm. For the retrieval experiments, it seems obvious that the k-way (k > 2) methods will perform better than the pairwise one, as they use more information (x_1 and x_2, sometimes x_3 vs x_1 only). This experiment is not really convincing and does not show the utility of k-way similarity as compared to say an average of pairwise similarities.
Q2: Please summarize your review in 1-2 sentences
If the work is interesting, it represents a direct extension of previous studies and is limited in its scope.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Dear Reviewers. Thank you all for the efforts and encouraging comments. Your constructive inputs will be carefully considered.


Assigned_Reviewer_5:

Q:-> The Google set experiment is interesting, but instead of "Sum Resemblance", perhaps the correct function is something that can simulate the conjunction (i.e., w1 is similar to w AND w2 is similar to w).
>> We appreciate reviewers concern here, and in fact it is a very interesting question to find a function which simulates the notion of “AND” well.

If we are constrained with notion of similarity being pairwise (which is the case with current literature) there is not much we can do. The most intuitive way is either to combine the similarity using addition ("Sum Resemblance" in the paper) or use pairwise retrieval for each w_1 and w_2 and do the intersection of top results (PI method in the paper)

Let us say we have found such “ideal” objective which combines similarity between w1 and w2 in the best possible way, we must also ensure that searching with such objective should be efficient. We all agree that efficiency is very crucial in search. Even linear algorithms are prohibitive with the size of web. It should be noted that there is no known provable sub-linear algorithm even for a very simple looking “Sum Resemblance".

3-way resemblance is a very intuitive higher orders similarity, which seems to models the notion of “AND” better than other intuitive pairwise approaches and at the same time it is sub-linear to search.


Assigned_Reviewer_6:

Q:->I did not find the words obtained by the 3-way similarity really better than the ones obtained by the standard Google algorithm.
>> Our aim was to demonstrate the applicability of 3-way similarity for the Google sets application over other similarity measures and not to beat Google’s original algorithm which most likely is much complicated and uses lot more data than 1.2M wiki documents. Interestingly, for the query “finger” and “lakes” we do retrieve better words (see line 354) than Google which is very encouraging. Even for the query “Jaguar” and “Tiger”, 3-way retrieves all the cat family in the top 5 while Google retrieve “dog” as the 5th best candidate which is not a cat family.

Q:->For the retrieval experiments, it seems obvious that the k-way (k > 2) methods will perform better than the pairwise one, as they use more information (x_1 and x_2, sometimes x_3 vs x_1 only).
>> We agree with the reviewer that if we are using more information compared to the pairwise case then we should do better. The experiments establish that 3-way and higher order resemblance search is utilizing this extra information in the right way which we believe is the first step.

The most exciting part which comes from the theory in the paper is that 3-way similarity search admits sub-linear algorithms with same guarantees as the pairwise case. Thus, it is possible to use the extra information in multiple queries, which cannot be used by vanilla pairwise retrieval, without sacrificing computation which is very critical in the big data world.

The biggest disadvantage about sum of pairwise similarity retrieval or any other ways of combining the objective is that it is not known to be efficient. There is no known provable sub-linear algorithm for such combination of similarity.


**************************************

There have been plenty of studies on pairwise similarities but the notion of 3-way and k-way similarities is severely under exploited. We hope our paper will be a good starting point, for an exciting line of research and applications.

We expect many opportunities can be explored. For example, as Reviewer_6 pointed out, it would be useful to test k-way similarity in non-binary (0/1) datasets. We fully agree this would be a promising direction, while here we would like explain this issue a bit more, for the sake of discussions:

1. Other than min-hash, it is difficult to find another good 3-way or k-way similarity estimation algorithms. For example, random projection is extremely popular, but after random projection, all data points become (zero-mean) normal whose distribution is determined by covariance (i.e., pairwise similarity) only.

2. While many datasets used in academia research are not too high-dimensional (e.g., less than 10000 or million) and often not binary, it looks industry indeed uses extremely high-dimensional representations using n-grams. For example, around 2000, industry papers described the use of 7-gram. These days, some search engine applications start to use even 11-grams (by combining all 1, 2, 3, .., 11-grams). With such a high-order n-grams, usually only absence/presence information matters (in fact, most n-grams will occur at most once).

3. On a limited number of datasets we experimented in the past, it looks when the dimensionality of the data is not too high, often the non-binary feature values help (for classification/clustering). But if we use higher n-grams, often only presence/absence matter. For example, using 1-gram (character n-gram), Webspam dataset has 254 dimensions (size of ascii table-1). The classification accuracy using linear SVM is about 93% vs 88% if we use non-binary(tfid) vs binary-only features. With 3-grams, (dimensionality 16 Million) the classification accuracy is 99.6% whether or not we use binary or non-binary (tfid) features. We observed similar phenomenon with other text datasets.

4. Despite the success of using extremely high-dimensional 0/1 high-order n-grams representations, we agree with reviewer that it is nevertheless an interesting research problem to look into non-binary features. The definition of 3-way minwise hashing can be extended to e.g., sum_i min{x_i, y_i, z_i}/max{x_i, y_i, z_i}, which might be something interesting to study.

Thanks again for all your encouraging comments.