NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 798 Computing Full Conformal Prediction Set with Approximate Homotopy

### Reviewer 1

The authors clearly addressed my concerns so I raised my score from 6 to 7. -------------------------------------------------------------------------- The paper is well written and the methodology is potentially useful in this literature. However I have a few major concerns. (1) The exact conformal prediction set may not be an interval but a union of multiple intervals in principle. This is problematic from a statistical and a practical point of view. That is partly the reason why split conformal set (e.g. Lei et al. 2018) or cross-validated conformal set (e.g. Barber et al. 2019) is preferred because they guarantee the set to be an interval despite some efficiency loss. For instance, if the prediction set is [3, 5]U[7, 9], how can practically interpret why 5 and 7 are in the set while 6 is not? Back to the exact conformal prediction set, in most cases we do not need this set other than its left and right endpoints, say C_L and C_U, based on which we can construct a confidence interval [C_L, C_U]. However, much of the efforts in this paper are made on approximating the whole set which appears to be an overkill. For this reason, I do not see why we need to use the homotopy method to approximate the exact conformal set. (2) How do you choose y_min and y_max? Underestimating y_max or overestimating y_min may significantly reduce the coverage. There must be a theoretically guaranteed way to select these two parameters. (3) On the other hand, if they can be chosen, as mentioned in point (1), we only need to find C_L = \inf\{y > y_min: \pi(y) > \alpha\} and C_U = \sup\{y < y_max: \pi(y) > \alpha\}, which seems to be much easier problem than approximating the whole conformal set. Minor points: (1) Equation (15): the right parenthesis in the denominator is missing. (2) Proposition 5 in Supplementary Material: equation (30), G_y(\beta(y, \theta(y))) --> G_y(\beta(y), \theta(y)); line 386, D_y(\theta) --> D_y(theta(y)); line 386, \theta - \hat{\theta}(y) --> \theta(y) - \hat{\theta}(y). References Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111, 2018. Barber, R. F., Candes, E. J., Ramdas, A., & Tibshirani, R. J. (2019). Predictive inference with the jackknife+. arXiv preprint arXiv:1905.02928.

### Reviewer 2

The paper addresses the topic of efficient computation of high-confidence prediction sets in the context of regression in conformal prediction. The underlying regression model is the empirical-risk minimizer over linear functions, potentially with a convex regularization on the parameters. This research topic is quite significant, since computation of these prediction sets is notoriously expensive. The approach taken by the authors is interesting and promising, however, the paper lacks at several points. For one, the paper is rather hard to read. Sections 1-2 are well written and give an easy introduction to the topic. From section 3 on (and especially in section 4), however, the flow is broken; the paper digresses towards side remarks and fails to concisely explain its approach. The main algorithm for computation of the prediction sets is never fully stated. At times, notation is unclear (such as using G both for gradient and duality gap, or using \times as a multiplication operation in (20)). Second, it is hard to separate the contribution of this paper from prior work, especially the work on Ridge Regression and Lasso Regression mentioned in the paper ( and  in the references, respectively). While apparently the warm-start approximate solutions are novel work in this direction, it is not properly explained how this compares to the homotopy approach described especially in . This problem gets even more pronounced when in the experimental section, the proposed method is not properly compared against those methods. For the Ridge Regression, Fig 1 is supposed to give a comparison; however, the figure is very hard to understand, operates on different scales and axes and also gives no implication wrt the computation timing - which is supposed to be the main advantage of the novel method. For the Lasso, a comparison to  is completely missing from the experimental section, which arguably should be the closest related work to this paper. (although a comparison to oracle & split methods is provided) A third (minor) shortcoming of this paper is that there are many typos and questionable style formats (almost every paragraph, e.g. very regularly using singular when plural would be appropriate; sometimes headings end with a dot, sometimes not; Statements/Lemmas are differently numbered in the appendix than in the main text) All in all, while the research topic and contributions look promising, the paper doesn't feel mature enough for publication, and it is unclear how it will actually compare empirically against state-of-the-art methods.

### Reviewer 3

1. Originality: The technique proposed in the paper to update an approximate solution to certain optimization problems for streaming data is inspired from previous work on homotopy continuation, but is an interesting and new contribution. Combined with its use for efficiently constructing conformal sets, the contribution is novel. The paper clearly indicates prior work and distinguishes itself from the existing methods. 2. Quality: The submission is technically sound. The theoretical claims are supported with proofs and the work represents relatively complete body. 3.Clarity: The paper is clear and well-written for the most part. Please see improvements for further comments. 4. Significance: The paper proposes an efficient method for an emergency problem of interest to construct distribution-free prediction intervals for machine learning problems. The numerical performance of the method indicates promising results compared to an existing method and is likely to draw interest from researchers working on similar problems. The technique used to build the method is additionally also of independent interest and potentially might have other useful applications.