NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5389
Title:Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks

Reviewer 1


		
Originality: the proposed method in this paper is original. It focuses on vulnerability identification in software systems. Quality: this work is almost technological sound. Honestly, I have no research experience on vulnerability identification but the problem is well-defined and is modeled as a machine learning problem in this paper. The whole process is clear and the motivation is clearly described. I prefer to see this work in the application section of NeurIPS. Clarity: This paper is well organized. Significance: It is clear that other researchers can use the ideas of Devign, especially the graph embedding approach of this paper. It considers the semantics of nodes and structural features in vulnerability identification.

Reviewer 2


		
This paper addresses the prediction of functions with vulnerabilities. The dataset used is based on real-world applications (e.g., Linux (kernel?), QEMU). The method employed represents code by combining different static code analysis techniques. All static graphs are well-known, however I haven't seen all of them combined in one graph for representing code. The graphs used are ASTs, CFG, DFG (with some simplifications, please see comments in improvements), in addition to the code treated as a sequence of tokens. The paper is well written, generally straightforward to read and has sufficient background information on graph embeddings of code and the various graphs employed such that a reader outside of the domain can follow the discussion. The results obtained are quite promising. In particular, I appreciated how the authors look at the latest commits in the projects they used in the dataset to understand whether their trained model could deal with code "in real time".

Reviewer 3


		
The main contribution of this paper is a manually curated dataset of functions determining if a function is vulnerable or benign. The novelty here is that there is no bias introduced by either assuming that most of the data is correct (assumed by anomaly detection works like e.g. [19]) or encoding the bias of an existing static analyzer. The evaluation results on this datasets, however, are not convincing for practical application of the resulting classifier. The training data has similar number of vulnerable and benign graphs, while practical programs have much lower percentage of vulnerable functions than the accuracy of the classifier. Thus, accuracy in the 70-80% range is not practical and likely its output in practice will look like pure noise (if 2 out of 100 functions are vulnerable, a classifier with 70% accuracy will give on average 28-29 false positives and has non-trivial chance to miss a vulnerability). This means that the classifier needs significant changes. Furthermore, the baseline to which the paper should compare is static analysis, not other neural architectures. Technically the paper is almost a verbatim copy of the architecture already proposed in [19] for prediction of (variable) names in programs. The edge types described in Figure 2 are also quite similar to the ones in [19]. Unfortunately, the writing of the paper makes it appear as if these architectural details are contributions of this work. The evaluation also shows that the neural network captures program-specific features that do not transfer between programs. The combined accuracy of the classifiers is lower than most of the individual classifiers. This means that collecting more training data will not help, but may actually harm the classifiers. It also jeopardizes the paper claims of capturing semantic features. minor: pg.6 s/non-CFC/non-VFC/ - UPDATE - The authors have mostly answered my concerns with the extra experiments. The accuracy metric gives little information about the recall or the precision of the learned code analyzer. It would help to provide precision instead of accuracy in the evaluation. In terms of differences from [19], partial control flow is also encoded there. Also other papers encode control flow. It would help if the paper focuses more on empirical evaluation (as with the provided extra experiments) and introducing the new task, because the actual graph architecture is hard to differentiate from prior works.