Reviews: Unconstrained Monotonic Neural Networks

*** Update in response to author rebuttal *** I definitely think that the UMNN technique has good potential and the authors should continue to pursue it. However, even after reading the rebuttal, I feel that it is a bit premature to publish the research at this point in time. In the rebuttal, the authors acknowledge that their method is not the first universal monotonic approximator and clarify that their language regarding the "cap on expressiveness" of alternative monotonic approximators refers to the non-asymptotic case, i.e., a finite number of neurons/hidden units. They write "we believe that the constraints on the positiveness of the weights and on the class of possible activation functions are unnecessarily restraining the hypothesis space in the non-asymptotic case". However, this is an assertion for which they have not supplied any kind of proof, and I find it highly debatable. Any method, whether it is their UMNN or the Huang approach or lattices or max/min networks, has some cap on expressiveness in the non-asymptotic case. They seem to be suggesting that given a fixed budget of parameters (neurons or whatever), that the UMNN technique will be more expressive (i.e., model monotone transformations more accurately) than alternative techniques, but it's not obvious at all to me that this is true. I am particularly skeptical given that UMNN (unlike other techniques) requires numerical integration which itself necessarily involves some sort of imperfect approximation given a finite number of integration steps. Even setting aside numerical integration, it is not clear to me that a UMNN with a a budget of P parameters will be more expressive than a lattice or max/min network allocated the same number of parameters P. Lines 148-149 in the paper are a better description of the advantages of the technique: "it enables the use of any state-of-the-art neural architecture". This is where the potentially significant advantage of the UMNN method is, in my opinion. Sometimes certain architectures or activation functions work better on certain problems than others for somewhat mysterious reasons (local minima using one activation function which do not occur using another one, etc) and their technique allows for the possibility of trying various different activation functions (rectified linear, sigmoidal, RBF, etc) and other architecture variations. Unfortunately, they don't illustrate this advantage experimentally. If they had shown that sigmoidal UMNN fails on one dataset whereas ReLU does well but on a second dataset ReLU fails but sigmoidal does well, that would be a nice illustration of the flexibility advantages of their technique, but these kinds of experiments are not in the paper. I suggest the authors pursue this line of research in the future. As it is, what they have presented seems to be mostly just a new way to do monotonic modeling but one that is not obviously better- the "cap on expressiveness" language is entirely unconvincing to me. Also, their claim that NAF and B-NAF did not report results on MNIST due to memory explosion still seems to me to be speculation. It shouldn't be that hard to calculate how many parameters (and therefore how much memory) NAF or B-NAF would actually require on MNIST- you could almost surely do the memory requirement calculations without actually implementing the techniques. As I look at the number of parameters table (Table 2), I see 7.49e6 parameters for NAF for MINIBOONE (d=43), 3.46e6 for UMNN, which is a ratio of 749/346=2.16. For BSDS300, it is 3.68e7 vs. 1.56e7, a ratio of 2.36, not much bigger than 2.16 for MINIBOONE. It's not clear to me how to extrapolate the ratio for MNIST, but it's not at all obvious that the ratio would be that large- it might only be a factor of 3 or 4, which is disadvantageous but hardly explosion. If the ratio is much higher than 3 or 4, well, the authors had an opportunity to tell me the actual ratio after I nudged them, but they declined to do so. Again, I also don't think the experimental results are that strong. For VAE, it loses across the board to B-NAF. For density estimation, it looks like it also loses by a wider margin on GAS and MINIBOONE than the margin by which it wins on POWER, and it seems questionable to claim that it wins on MNIST when the other techniques simply do not report on MNIST. *** Original comments *** I have good familiarity with the monotonic modelling literature and I have not seen any prior work which models the derivative and then integrates, so this technique definitely looks original to me. One significant concern I have about the paper is that it does not do a good job of describing prior work on monotonic modeling. The authors claim that (prior to their own work) architectures which enforce monotonicity do so in ways which "lead to a cap on the expressiveness" of the hypothesis class. This is not correct regarding the Sill, 1998 "Monotonic Networks" paper. That paper shows that a 2-hidden-layer architecture which combines minimum and maximum operations is capable of universal approximation of monotonic functions, given sufficient hidden units. The authors also fail to cite the work of Gupta et. al. on lattice-based monotonic models ("Monotonic Calibrated Interpolated Lookup Tables", JMLR 2016 and "Deep Lattice Networks and Partial Monotonic Functions", NeurIPS 2017). The authors claim "[o]ur primary contribution consists in a neural network architecture that enables learning arbitrary monotonic functions". It is true that they present a novel way to do so, but there are multiple established techniques already in the literature for this. So while the work is original, the significance is overstated. A stronger paper would have compared their "implement the derivative and integrate" technique to the other techniques or at least attempted to explain why their technique is preferable. Another concern is that the experimental results are respectable but not as strong as I would like. UMNN-NAF only wins on 2 of the 6 density estimation tasks. In variational auto-encoding, it loses to FFJORD and B-NAF. It is indeed enocuraging that the authors obtained results on MNIST, and it might indeed be the case that competing techniques cannot run on MNIST due to "memory explosion", but this appears to be a speculative theory on the part of the authors. The authors would be on firmer ground if they had actually attempted to implement competing techniques on MNIST and found memory explosion, but they do not appear to have done so.

# Originality The authors propose to parameterize a monotonic network by parameterizing its derivative. The derivative of a monotonic network is much easier to parameterize because it's only required to be positive, and thus can be modeled by "free-form" neural networks, without special constraints on the weights and activation functions. The monotonic network can then be computed by numerical integration of the derivative network. Backpropagation through the monotonic network can be done by another numerical integration. For all numerical integration computations, this paper proposes to use the Clenshaw-Curtis quadrature approach, which is efficiently parallelizable. The monotonic networks proposed are used for constructing an expressive autoregressive flow model, which is proved to be competitive to NAFs and B-NAFs. The approach is novel. The difference from previous work is clear. Closely related works such as NAF and B-NAFs are cited and compared. I also like the explicit mentioning of numerical inversion of autoregressive flow models and the experiments to demonstrate the samples. Although the approach is straightforward, this is probably the first time that samples have been reported for these autoregressive flow models. # Quality The submission is technically sound. There are also some discussions on the limitations of this method in section 6. The experimental evaluation part is also largely satisfying, and it would benefit from clarifying the following: 1. In section 5.2, the authors hypothesized that NAF and B-NAF do not report results on MNIST due to memory explosion. I'm wondering why memory should be an issue for NAFs and B-NAFs to work well on MNIST? Why can using UMNN solve this problem? There doesn't seem to be any inherent difference of UMNN that makes it particularly good for memory limited cases. 2. Numerical integration is arguably slow and only provides an approximation, yet it is used for both forward and backward propagation of the network. How fast is the Clenshaw-Curtis quadrature algorithm? I noticed that the authors didn't check the "An analysis of the complexity" item on the reproducibility checklist. It would be better to explicitly discuss this somewhere in section 2 or 3. Also, accurate integration requires the derivative networks to have small Lipschitz constants. Do you have the results on performance vs different Lipschitz constant and different number of integration steps? Both NAF and B-NAF showed some theoretical results on uniform density estimators. Is UMNN also a uniform density estimator? I would imagine so, but it would be nice to have some formal proof and discussion on this. # Significance As the results are reasonable and the approach is novel, I think this work provides an interesting and useful idea for the field. # Clarity The paper is very well written.

Paper ID:	859
Title:	Unconstrained Monotonic Neural Networks

Reviewer 1

Reviewer 2

Reviewer 3