NIPS 2017
Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center
Paper ID: 1939 Deep Sets

### Reviewer 1

The paper presents a way to design neural networks that are invariant / equivariant to permutations, making them suitable for set-valued data. A proof is given that essentially any invariant function can be written in the proposed form rho(sum_x phi(x)), where phi is a learned feature transform and rho processes the summed representations. Furthermore, such a function is always invariant. The paper contains extensive experiments, that show the proposed architecture is applicable to a wide range of problems and yields competitive results across the board. The experiment on digit addition shows very good performance, but it should be noted that addition is directly baked into the model, so this is perhaps not very surprising. The paper is fairly well written and very well structured. Section 2.3 discusses well known related results. It took me quite a while to understand what the relation to the present work is. I think this section could be made clearer. Lemma 3 is correct, but implicitly assumes that the representation of the symmetric group acting on the input should be the same representation as the one acting on the output. A simple example of an alternative equivariant map could be one where the output is a scalar that changes sign when an odd permutation is applied and otherwise stays the same. Such a map could do pairwise comparisons at each layer, which I suspect would improve results for e.g. point cloud prediction. Lemma 3 was derived before in “Equivariance through parameter sharing” and other papers by Ravanbakhsh. The paper by Ravanbakhsh on “estimating cosmological …” appears twice in the reference list. In figure 1, why does the curve for SDM stop before the DeepSets curve? In summary, I think this is a solid paper with a strong theoretical foundation and extensive experimental validation. Minor comments: 3: “and are” -> “that are” 30: “in field” -> “in the field” 47: “not to be indifferent” -> “to be indifferent” 50-57: several typos / grammar mistakes 84: it is not immediately clear what M_0 is; it would be good to have a line about it. 156: matrices

### Reviewer 2

The paper characterises functions that are permutation invariant and equivariant (the latter under certain restrictions); it then derives form these characterisation principle for the design of deep neural networks that operate on set-like objects. A number of diverse experiments validate the approach. The problem of learning to classify or regress from sets is an important one. The diversity of applications considered in the experiments is notable, although some of them are "mere" toy problems. Nevertheless, I was still slightly turned off by the nature of some of the results. Theorem 2 indicates that any function of the power set of a countable set has an additive decomposition (by the way, it is never said that the functions $Phi$ in the decomposition can be a scalar one, but that is the case from the appendix). On a glance, this is an exciting result. However, it is obtained by constructing a surjective mapping from the power set of N to the real numbers. This construction does not show that such an additive decomposition is a reasonable one in a learning setting. For example, representing the function that computes the maximum element in a set (which is obviously permutation invariant) may be very inconvenient to code in this manner. Furthermore, the proof itself may be a bit problematic as the proposed mapping does not seem to be surjective after all; for example, 2^-1 = sum_{k=2}^{+inf} 2^-k, so the sets {1,0,0,0,...} and {0,1,1,1,1,...} would have the same code. Lemma 3 is also very restrictive. Here the authors show that a fully connected layer which multiplies all input components by the same scalar and sums to that a scaled average of the components is the only one that can be permutation equivariant. This is fine, but at the same time does not allow the neural network to do much. Most importantly, these two theoretical results do no appear to suggest genuinely new approaches to permutation-invariant data processing in neural networks. The idea of using commutative pooling operators (sum or pooling), in particular, is completely standard. The sum of digits experiment is a bit suspicious. In the text case, in particular, it would be enough to map each digit to its corresponding numerical value (a triviality) to allow the sum operator solve the problem exactly -- thus is not exactly surprising that this method works better than LSTM and GRU in this case. The image case is only slightly harder. The rebuttal is informative and answers the most important questions raised in the review well. Therefore I am slightly upgrading my already positive score. By the way, I understand the necessary condition nature of Lemma 3. My point is that this results also shows that there isn't all that much that can be done in terms of data processing if one wants to be permutation invariant.