Reviews: NeuralFDR: Learning Discovery Thresholds from Hypothesis Features

The paper describes a new method for FDR control for p-values with additional information. For each hypothesis, there is a p-value p_i, and there is also a feature vector X_i. The method learns the optimal threshold for each hypothesis, as a function of the features vector X_i. The idea seems interesting and novel, and overall the paper is explained quite clearly. In several simulated and real data example, the authors show that their method can use the additional information to increase the number of rejections, for a given FDR control threshold. It seems to me to be important that the X_i's were not used to calculate the P_i, otherwise we get a problem of circularity. The authors mention this in an example, but there is no formal treatment of this - it is not clear what should be the probabilistic relationship between the X_i's and the p_i's. On the one hand, both the p_i's and the X_i's are likely to be dependent on the correctness of the null vs. alternative hypothesis, but on the other hand, if the X_i's were already used to calculate the p_i's, then they should not improve the decision boundary. The fact that X is multidimensional should not immediately rule out non-parametric methods as the authors claim - for example nearest-neighbours regression can still adapt to the 'true' intrinsic dimension of the regression function, even if the dimension of X is large. It is not clear to me how immune to overfitting is the cross-validation procedure proposed by the authors. The settings is not a standard supervised learning approach, since the 'label' FDR is unknown and is replaced by an estimator. It would be good to emphasize and elaborate this point more clearly. Then, since the estimator of FDR is noisy, this may yield to my understanding higher FDR for the test set. The authors do show in Theorem 1 a bound which grows with the number of folds in the cross-validation procedure. The mirror estimator proposed by the authors may have little bias but large variance if t(x) is small, since very few p-values will be between 1-t(x) and 1. This issue comes up in Story's approach where lambda parameter is chosen such that the null proportion, estimated using the interval [lambda, 1] should balance variance and bias. Minor: ====== Line 136: 'this' -> 'these'

Reviewer 2

Multiple hypothesis testing is an important step of many high-throughput analyses. When a set of features is available for each of the tested hypothesis, one can use these features to weight the hypothesis and increase statistical power, a strategy that is known as independent hypothesis weighting. Specifically, this can be achieved by learning the discovery threshold on P-values as a function of the hypothesis features. In this paper, the authors propose a new method (neuralFDR), where a neural network is considered to model this relationship. Considering simulations and two real data applications, the authors show that (i) neuralFDR can increase power while controlling the false discovery rate and (ii) it can be more powerful than existing weighting methods (especially when multiple hypothesis features are considered). The results are convincing, the paper is well written and the method has the potential to be used in high-throughput hypothesis testing problems. I have only two points that I would like to raise: 1) the authors do not discuss how computationally demanding their method is and how it compares to others. These considerations will certainly impact the choice of a potential user on which correction method to use; 2) In the application to the GTeX dataset, the authors showed that increasing the number of considered features, the number of discoveries of neuralFDR increases while the number of discoveries of IHW (the competitor method) decreases. In the most favourable scenario (3 features), they report 37,195 vs 35,598 discoveries, which corresponds to a 4.5% increase. Given this small percent increase, is there any reason why the authors did not include other features to further increase discoveries? In eQTL mapping there are plenty of variant features or gene features to choose from. Minor comments: Line 215. “800 more discoveries” the percent increase should also be reported Table 2: I think that absolute and percent increases should be reported to improve readibility

Reviewer 3

This paper proposes a procedure to learn a parametric function mapping a set of features describing a hypothesis to a (per-hypothesis) significance threshold. In principle, if this function is learnt appropriately, the approach would allow prioritising certain hypotheses based on their feature representation, leading to a gain in statistical power. The function generating the per-hypothesis rejection regions is trained to control the False Discovery Proportion (FDP) with high probability, provided some restrictive assumptions are satisfied. While multiple hypothesis testing is perhaps not a mainstream machine learning topic nowadays, key application domains such as computational biology or healthcare still heavily rely on multiple comparisons, making this a topic worth studying. A clear trend in this domain is the design of methods able to incorporate prior knowledge to upweight or downweight some hypotheses; the method described in this paper is one more example of this line of work. Despite the fact that the topic is relevant, I have some serious concerns about the practical applicability of the proposed approach, as well as other aspects of the paper. ***** Major Points ***** (1) In my opinion, the paper heavily relies on assumptions that are simply false for all motivating applications described in the paper and tackled in the experiments. Firstly, the method assumes that triplets (P_{i}, \mathbf{X}_{i}, H_{i}) are i.i.d. If this assumption were not satisfied, the entire method would break down, as cross-validation would no longer be appropriate to validate the decision rules learnt on a training set. Unfortunately, the independence assumption cannot be accepted in bioinformatics. For example, in genome-wide association studies, variants tend to be statistically dependent due to phenomena such as linkage disequilibrium (LD). In RNA-Seq analysis, gene expression levels are correlated according to complex co-expression networks, which are possibly tissue-specific. More generally, unobserved confounding variables such as population structure, batch effects or cryptic relatedness might introduce additional non-trivial statistical dependencies between the observed predictors. There exist straightforward extensions of the Benjamini-Hochberg (BH) procedure to guarantee FDR control under general dependence. Unless the authors extend their approach to allow for arbitrary dependencies between hypotheses, I do not believe the method is valid for the applications included in the experiments. Moreover, the assumption that all triplets (P_{i}, \mathbf{X}_{i}, H_{i}) are identically distributed is also hard to justify in practice. Another questionable assumption, though perhaps of lesser practical importance when compared to the two previous issues, is the fact that the alternative distribution $f_{1}(p \mid \mathbf{x})$ must be non-increasing. The authors claim this is a standard assumption, which is not necessarily true (e.g. see [1] for a recent counter-example, where the alternative distribution is modelled as a beta distribution with learnable mean and precision). (2) The paper seems to make a heavy emphasis on the use of neural networks, to the point this fact is present in the title and name of the method. However, neural networks play a small role in the entire approach, as essentially all that is needed is a parametric function to map the input feature space describing each hypothesis to a significance threshold in the unit interval. Most importantly, all motivating examples are extremely low-dimensional (1D to 5D) making it likely that virtually any parametric function with enough capacity will suffice. I believe the authors should include exhaustive experiments to study how different function classes perform in this task. By doing so, the authors might prove to which extent neural networks are an integral part of their approach. Moreover, it might occur that simpler, easier to interpret functions perform equally well, in which case it might be beneficial to use those instead. (3) While the paper contains virtually no grammar mistakes, being easy to read in that regard, the quality of the exposition could be significantly improved. In particular, Section 3 lacks detail in the description of Algorithm 1. Throughout the paper, the notion of cross-validation is rather confusing, as in this particular case distinct folds correspond to subsets of features (i.e. hypotheses) rather than samples used to compute the corresponding P-values. I believe this should be clarified early-on. Also, the variables $\gamma_{i}$ appear out of nowhere and are not appropriately discussed in the text. In my opinion, the authors should additionally adopt the notation used in Equation (8) in Supplementary Section 2 for Equation (3), in order to make it clearer that what they are truly optimising is only a smooth approximation to the true objective function. The description of the implementation and training procedure in Supplementary Section 2 is also insufficient, as it is lacking a proper justification of the hyperparameter choices (e.g. network architecture, $\lambda$ values, optimisation scheme) and experimental results describing the robustness of their approach to such hyperparameters. In practice, setting these hyperparameters will be a considerable hurdle for many practitioners in computational biology and statistical genetics, which might not necessarily be familiar with standard techniques in deep learning. Therefore, I believe these issues should be described in greater detail for this work to be of use for such communities. ***** Minor Points ***** The use of cross-validation makes the results random by construction, which is an undesirable property in statistical association testing. The authors should carry out experiments to demonstrate the stability of their results across different random partitions and to which extent a potential underlying stratification of features might be problematic for the cross-validation process. Decision rules in multiple hypothesis testing ultimately strongly depend on the number n of hypothesis being tested. However, the proposed approach does not explicitly depend on n but, rather, relies on the cross-validation set and the test set containing exactly the same number of hypotheses and all hypotheses being i.i.d. It would be interesting if the authors were able to relax that requirement by accounting for the number of tests explicitly rather than implicitly. The paper contains several wrong or imprecise claims that however do not strongly affect the overall contribution: - “The P-value is the probability of a hypothesis being observed under the null” -> This is not a correct definition of P-value. A P-value is the probability of observing a value of the test statistic that indicates an association at least as extreme under the assumption that the null hypothesis holds. - “The two most popular quantities that conceptualize the false positives are the false discovery proportion (FDP) and the false discovery rate (FDR)” -> Family-Wise Error Rate continues to be popular and has a considerably longer history than any of those quantities. Moreover, while it is true that FDR is extremely popular and de-facto replacing FWER in many domains, FDP is relatively less frequently used in comparison to FWER. - Definition 1 is mathematically imprecise, as it is not handling the case $D(t)=0$ properly. - “Moreover, $\mathbf{x}$ may be multidimensional, ruling out the possibility of non-parametric modelings, such as spline-based methods or the kernel based methods, whose number of parameters grows exponentially with the dimensionality” -> Kernel-based methods can be applied in this setting via inducing variables (e.g. [2]). Additionally, like mentioned earlier, many other approximators other than neural networks require less an exponentially-growing number of parameters on the dimension of the feature space. - Lemma 1 does not show that the “mirroring estimator” bounds the true FD from above, as Line 155 seems to claim. It merely indicates that its expectation bounds the expected FD, which is a different statement than what is discussed between Lines 143-151. - As mentioned earlier in this review, the statement in Lines 193-194 is not completely correct. Five dimensions is definitely not a high-dimensional setting by any means. At least thousands of features should have been used to justify that claim. The paper would benefit from proof-reading, as it contains frequent typos. Some examples are: - Line 150: $(t(\mathbf{x}), 1)$ -> $(1-t(\mathbf{x}), 1)$ - Line 157: “approaches $0$ very fast as $p \rightarrow 0$” -> “approaches $0$ very fast as $p \rightarrow 1$” - Line 202: “Fig. 3 (b,c)” -> “Fig. 4 (b,c)” - Lines 253-254: The alternative distribution $f_{1}$ is missing parenthesis - Supplementary Material, proof of Theorem 1 and Lemma 2: All terms resulting from an application of Chernoff’s bound are missing parenthesis as well. While it might have been intentional, in my opinion it makes the proof a bit harder to read. References: [1] Li, Y., & Kellis, M. (2016). Joint Bayesian inference of risk variants and tissue-specific epigenomic enrichments across multiple complex human diseases. Nucleic acids research, 44(18), e144-e144. [2] Dustin Tran, Rajesh Ranganath, David M. Blei - The variational Gaussian process. A powerful variational model that can universally approximate any posterior. International Conference on Learning Representations, 2016

Paper ID:	982
Title:	NeuralFDR: Learning Discovery Thresholds from Hypothesis Features

Reviewer 1

Reviewer 2

Reviewer 3