NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 3634 Graph-based Discriminators: Sample Complexity and Expressiveness

### Reviewer 1

This is a good paper. The idea of using multivariate functions in order to perform better learning (distinguishing different distributions) is quite evident, especially when considering the fact that the assessment of the quality of the discrimination in the paper is expectation-related. It is an interesting and well structured paper. The statistical analysis of the Graph-based Discriminator classes is clear and informative and invites the follow up questions presented (closing down the gap in sample complexity, extending the expressiveness bounds for larger k). A few other points regarding this paper: 1) In line 94 - You state that o(\rho) examples are sufficient for discrimination. I believe (even though this is early in the paper) that it would be clearer to state this size using the \epsilon and \delta confidence and accuracy parameters at this stage. 2) In line 229 - You write 'fix all coordinates'. It is possible that this phrase is ambiguous but my understanding is that you intend to fix only a single coordinate while all others remain parameters of the function (executing a reduction to k-1 gVC), While what I understand from the reading the phrase is that after fixing all but one coordinate we end up with a function which receives 1 parameter instead of k-1. 3) I would be very interested in seeing future work regarding different types of discrimination 'loss' (equation 1 in line 31) since the difference in expectations of distinguishing functions hardly seems the best tool for differentiating distributions using a minimal amount of tests. Best of luck

### Reviewer 2

Due to the shortness of time, I only comment upon the expressivity results. The expressivity results are interesting and novel. However, I found it rather hard to sort out the big'' picture of these results, and perhaps some additional general discussion would be helpful. It seems that in this part of the paper a kind of partial ordering between distinguishing classes is studied, where a class is larger than another if it can distinguish more pairs of distributions (classes can also be equivalent, and because of the parameters it is not clear how to formalize this precisely). Theorem 4, for example, shows the superiority of using k=2 over k=1. The graph class is a single graph (an infinite bipartite graph encoding all finite subsets of one half by vertices of the second half). It is shown that for any finite-VC class there are distributions distinguished by the graph but not the class. In terms of ordering the theorem says that there is a class for $k=2$ (consisting of that single graph) which is not smaller than or equal to any finite VC-class for $k=1$. The term incomparable''is used at a few places and it is not clear in what sense it is meant. Also, as the ultimate goal of the work seems to be to find possibly simple classes which can distinguish many pairs of distributions, it would be useful to comment on what are the implications of such incomparability results towards this goal.