NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2389
Title:Modeling Uncertainty by Learning a Hierarchy of Deep Neural Connections

Reviewer 1

Here are my comments for the paper: - B2N, RAI, and GGT abbreviations are never defined in the paper; the have been just cited from previous works (minor). A short background section on these methods can also include their full name. - The distinctions and contributions of this work compared to [25] and [28] can be written in a more clear way. As far as I understand, the proposed method is B2N with B-RAI instead of RAI which was originally proposed in [25]. This allows the model to sample multiple generative and discriminative structures, and as a result create an ensemble of networks with possibly different structures and parameters. Maybe a better way for structuring the paper is to have a background section on B-RAI and B2N, and a separate section on BRAINet in which the distinction with other works and contribution is clearly written. - For OOD experiments, the prior networks loss function [18] is used but the method is not used as a baseline in the experiments. It would be nice to have the results of OOD experiments with [18] and possible some other more recent work on OOD detection. See for instance, “Reliable uncertainty estimates in deep neural networks using noise contrastive priors” or "A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks”. Specifically, the latter has results on the same dataset. - In terms of empirical evaluation and compared to much simpler baselines which do not require structure learning (e.g. [16]), the method doesn’t show a major improvement. Overall, I found the idea of modeling uncertainty with an ensemble of networks with possibly different structures and parameters interesting. However, given the complexity of the model which requires structure learning using a recursive algorithm, the performance gain is not that impressive. This weakness could have been outweighed by technical novelty, but the method is very similar to the B2N approach (with B-RAI instead of RAI) which in my opinion makes the novelty incremental. --------------------------------- I've read authors' feedback and I'd like to change my score; however, I'm still not convinced that the contribution is significant enough to pass the conference acceptance standard.

Reviewer 2

Summary The authors propose a new method for Bayesian deep learning. The idea is to learn a hierarchy of generative models for the data (using [1]), that can then be converted into discriminative models (using [2]). The authors then assign a posterior value to each of the subnetworks, and sample subnetworks to perform Bayesian model averaging. The authors then conduct a series of experiments on different types of uncertainty estimation with strong results. I have limited familiarity with probabilistic graphical models, and I am unable to provide feedback on the quality of the methodological contribution in the paper. For this reason, in my review I focus on the empirical evaluation and clarity. Originality The paper provides a very non-standard approach to Bayesian inference for deep neural networks. The method is very original and interesting. Related work on Bayesian deep learning is cited appropriately. Clarity I believe some parts of the paper could be clarified: 1. It was unclear to me how exactly is the method applied in the experiments. From Section 3 my understanding was that the method learns a hierarchy of Bayesian neural networks from scratch. How is it combined with given neural network architectures, e.g. in Table 1? Is it applied to learn the last several layers of the neural network? 2. Is it correct that the proposed method proceeds as follows: First, it produces a hierarchy of architectures for some subset of a given neural network architecture (see question 1). Then, it trains the weights in this architecture by sampling sub-networks uniformly and updating the weights by SGD. Then the method performs Bayesian model averaging with one of the two strategies described in the paper. I would recommend to include a high-level outline like this in the camera-ready version. 3. In section 3.2.2, what is c? Is it the score produced by B-RAI for the sub-network? If so, wouldn’t it make more sense to re-weight sub-networks based on the corresponding training loss value in the posterior? Quality As I mentioned above, I will mostly focus on the empirical evaluation and results. The reported results are generally very promising. 1. Section 4.2: the authors compare the predictive accuracy and likelihoods for the proposed method against deep ensembles, Bayes-by-backprop and Deep Ensembles on MNIST. How exactly were the Deep Ensembles trained here, keeping the model size fixed? 2. While the accuracy and NLL results appear strong, it would also be interesting to see calibration diagrams and ECE reported, as in [3] and [4]. 3. In Section 4.3 the authors apply the proposed method to out-of-distribution (OOD) data detection. The authors achieve performance stronger than other baselines on identifying SVHN examples with a network trained on CIFAR-10. For the dropout baseline, have you tried to apply it before every layer rather than just the last 2 layers? 4. The results in Table D-3 show expected calibration error for BRAINet on a range of modern architectures on image classification problems. Could you please report accuracies or negative likelihoods (or both) at least for a subset of those experiments? While ECE is a good metric for the quality of uncertainty, it is possible to achieve good ECE with poor predictive accuracy, so a combination of the two metrics would give a more complete picture of the results. Minor issues: Line 56: add whitespace before “In”. Lines 93-94: this sentence is not clear to me. Line 212: “outperforms” -> “outperform”. Significance The paper proposes a new ingenious approach to Bayesian deep learning. The reported results seem promising. Importantly, the method is applicable to modern large-scale image classification networks. However, the clarity of the paper could be improved. [1] Bayesian Structure Learning by Recursive Bootstrap; Raanan Y. Rohekar, Yaniv Gurwicz, Shami Nisimov, Guy Koren, Gal Novik [2] Constructing Deep Neural Networks by Bayesian Network Structure Learning; Raanan Y. Rohekar, Shami Nisimov, Yaniv Gurwicz, Guy Koren, Gal Novik [3] A Simple Baseline for Bayesian Uncertainty in Deep Learning; Wesley Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, Andrew Gordon Wilson [4] On Calibration of Modern Neural Networks; Chuan Guo, Geoff Pleiss, Yu Sun, Kilian Q. Weinberger

Reviewer 3

[Update after author feedback: I have read the other reviews and the author feedback and want to thank the authors, particularly for answering my clarifying questions in detail. As the other reviewers pointed out as well, the paper can still be improved in terms of clarity, particularly when addressing the (deep) neural network community as a main target audience. I want to encourage the authors to try and improve clarity as much as possible. To me personally, the paper is promising and interesting, but I find it hard to judge how novel the proposed algorithm is (hence my confidence of 2). I am personally leaning towards acceptance, though I understand the criticism and (as the score of 7 implies) "would not be upset if the paper is rejected".] The paper proposes to learn the structure (connectivity structure and parameter sharing) of a stochastic ensemble of neural networks. This allows for solutions ranging anywhere from large connectivity variation with full parameter sharing (MC Dropout) to solutions with no connectivity variation and no parameter sharing (“classical” ensemble of networks). To achieve this, the structure of an ensemble of networks is expressed as a prior over parameters in a Bayesian neural network (a common interpretation of both MC Dropout and deep ensembles). Crucially, the paper proposes to learn the structure, i.e. the prior, from the input data (e.g. images in image classification) in a self-supervised fashion. The claim is that such a prior (now actually a conditional distribution over parameters given the input data), if tuned to the generative process of the input data, reduces unwarranted generalization for out-of-distribution data of the “prior” (the conditional distribution) over the structure. To learn a prior that is tuned to the generative process of the data, the paper applies a recently proposed Bayesian structure learning algorithm (B-RAI) that induces a distribution over discriminative network structures that are proved to mimic the generative structure (line 115 to 117), thus yielding a dependency between the structure of an implicit generative model of the data and the structure of the discriminative model (line 127 to 128). The paper explains and illustrates the algorithm (both training and inference) and shows how to use it for estimating predictive uncertainties. Finally the paper evaluates the performance of the proposed algorithm by investigating calibration of uncertainty estimates (on MNIST, as well as CIFAR-10/-100 in the appendix) and out-of-distribution detection on (SVHN and CIFAR-10). The method performs well compared to MC Dropout and Deep Ensembles, and some other state-of-the-art methods for calibration (in the appendix). Combining the method with MC Dropout or Deep Ensembles yields even better performance. I should point out that I found the paper somewhat hard to follow and I might have missed something important or misunderstood small or large aspects of the paper. Therefore I want to encourage the authors to comment/correct my summary of the paper if necessary. I should also point out that I had reviewed this paper at a previous venue and was happy to see that many of the previous reviewer’s comments were taken into account, particularly w.r.t. to improving clarity and improving the empirical comparisons. Before stating my personal opinion, I would like to ask the authors a series of clarifying questions: 1) The paper claims that the method prevents unwarranted generalization on out-of-distribution data - under what conditions does that claim hold? Does the claim essentially rest on p(\phi|\theta) being “well behaved” for out-of-distribution data? 1a) If yes to the latter, does that mean that for out-of-distribution data p(\phi|\theta) needs to spread probability mass across \phi, i.e. different structures - which corresponds to having high uncertainty over the structure for out-of-distribution data? 1b) If yes to 1a), what are the arguments for why this is likely to hold in practice and can you show some empirical results that support these arguments? 2) Would it be fair to say that preventing unwarranted generalization for out-of-distribution data could in principle also be achieved with a large ensemble of networks of different (connectivity) structure, i.e. no parameter sharing (to prevent unwarranted generalization due to learned parameters) and a large variety of structure (to prevent unwarranted generalization due to a fixed connectivity structure)? 2a) If yes to 2), would the advantage of the proposed approach be that the method captures an (infinite) ensemble of networks of different structure more efficiently (in computational terms) in the form of a Bayesian neural network? Assuming that my current understanding of the paper is mostly correct (i.e. my summary is mostly correct and most of the questions above are answered positively), I am slightly in favor of accepting the paper, since it addresses two timely issues (well calibrated predictive uncertainties, and a formulation for Bayesian neural networks where the connectivity structure is not fixed but drawn from a learned prior) in a novel and original way. I think that the current experimental section is solid and convincing enough for publication, though it could always be improved of course. To me the paper has reached a point where it would be interesting to see independent replications and discussion as well as extensions of the findings by the wider community. I want to encourage the authors to try and further improve the clarity of the manuscript and presentation of the method - with a particular focus on (Bayesian) deep learning practitioners (see some suggestions for improvement below). Since there is still quite some uncertainty in my current understanding of the paper, I also want to point out that it is quite possible that I’ll change my final verdict quite drastically based on the author feedback and other reveiwer’s comments. Originality: BRAINet structure learning algorithm: medium to low (I find it hard to judge whether the algorithm follows trivially from previous work, or requires major innovation), improving calibration and out-of-distribution detection with the method: medium to high Quality: Medium - there is no thorough discussion of related work, the paper does not mention (current) limitations and shortcomings of the approach, but there is a large number of empirical evaluations with interesting comparisons against state-of-the-art approaches Clarity: Low - though clarity has certainly improved since the last version of the manuscript, I still had a hard time to distill the main idea, and the experimental details w.r.t. how exactly BRAINet is applied and trained still remain a bit muddy (without looking at the code). Significance: Medium - the approach is novel and original and the experiments look promising. To be fully convinced, I’d like to see a thorough discussion of the shortcomings, how well the approach scales potentially and results on larger-scale tasks (I do acknowledge that not all of this is necessarily within the scope of a single paper) Minor comments: Tuning the prior to the generative process of the data and then using it as a prior for the discriminative model sounds like a mild version of an Empirical Bayesian method, which are known to break some appealing properties of proper Bayesian methods at least in theory. While I am personally happy with exploring Empirical Bayesian methods (in fact many modern Bayesian neural network algorithms, for instance some Bayesian neural network compression methods based on sparse Bayesian learning, learn the prior and thus also fall into the category of Empirical Bayesian methods), I would not necessarily over-emphasize the Bayesian nature of the proposed approach. It could potentially be very interesting compare and connect/apply BRAINet to Graph Neural Networks. While a full treatment of this is beyond the scope of this paper, at least some discussion about similarities and differences in the approaches and whether they can easily be combined would be interesting.