NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 7779 Calculating Optimistic Likelihoods Using (Geodesically) Convex Optimization

### Reviewer 1

The paper is nicely written and easy to follow. For the modeling proposed in Section 2, it is easy to work with the regularized problem formulation, that is, min_Sigma + log det (Sigma) + lambda d^2(Sigma, \hat{Sigma}) for some lambda ambiguity radius. Consequently, the convexity proof can be made simpler by computing the Hessian of the regularized function and showing it to be positive definite. Also, the iteration complexity result should follow from [42, 41] of the manuscript. Is my understanding correct? If so, any particular reason for choosing the constraint version (6) instead of the regularized version? If my understanding is not correct, what are the differences of the iteration complexity results with that of [42, 41] apart from those in Lines 191 - 193. This is crucial as in the absence of it, Section 2 needs to be toned down. Theorem 3.5 is interesting. Is the regularized version of (12) not geodesically convex? ======= After rebuttal ===== I appreciate the author rebuttal. It clarifies some confusions.

### Reviewer 2

This approach to density matching is new to me. The paper is rather well motivated and written but some parts need clarification. - The main part of the paper consists in two parts: FR and KL ambiguity sets: * These parts seem (are?) somehow disconnected in terms of algorithmics. Could Pb. (12) be solved by (non-convex) gradient descent? * Given some accuracy could you compare the wallclock time between both approaches? * Could you precise the statement of line 59 on the statistical differences between FR and KL? * More generally, could you give (maybe as a table) a synthetic comparison between FR and KL approaches. - Could you provide more intuition about Algorithm 1: * why you need some kind of "ergodic geodesic average" in you algorithm. (See e.g. line 170-171) * Theorem 2.5 could be more emphasized, notably on how it relates to the complexity Theorem 2.7. - One of the drawback of the FR distance is that there is no closed form in the case where both mean and covariance are subject to ambiguity (footnote 1). However, the extent of the limitation is not clear: * How does this impact the KL part (section 3)? * In the application of Section 4 both classes means and covariances are different (225-226), so what approach to adopt? - Section 4/5 lack clarity compared to the rest of the paper. * 5.1 does not really bring much intuition about the behavior of the algorithm, it could be postponed to the supplementary * Section 4 and 5.2 could be merged, but 4 needs to be clarified (notably in relation with the question above). * The experimental setup could be more described (datasets, stopping criteria for FQDA and KQDA, etc.) Minor comments: * 109: By "solvable", do you mean feasible? More generally, could you provide a clear proof of Lemma 2.2 and clarify the paragraph 111-115 that I did not really get. * There is no condition present to ensure that the ambiguity sets do not intersect. Could this pose a problem?

### Reviewer 3

computationally intractable" -- What is the argument or reference to support the proposed logical implication? Section 2, 3rd line: the Fisher information matrix itself may be rank deficient, hence only positive semidefinite rather than definite. Is the statement here that, once restricted to the tangent space, it is always positive definite? Eq. (6): perhaps recall what \hat \Sigma is here. Footnote 2: "the circle ... has positive sectional curvature": This isn't correct. The circle is a one-dimensional manifold, and as such it has zero (intrinsic) curvature. The footnote can be adapted if we make it to be a statement about the unit sphere S^2 in R^3. More globally, it is a bit tricky to make this statement about geodesic convexity of subsets of a compact manifold, when Definition 2.3 is specifically crafted for Hadamard manifolds (as indicated by the comment that precedes it). Depending on how exactly one adapts Def. 2.3, certains subsets of the sphere may or may not be deemed geodesically convex (and that may or may not be useful). Line 173: "closed form ... highly efficient": There are still a number of non-trivial computations involved, namely, matrix exponentials and logarithms. I expect a number of readers may find that this isn't "highly efficient", since such operations cost O(n^3) flops. In fact, you say as much on line 212. Line 181: There -> The Lines 190++, "An important difference is that [41, Theorem 9] requires the objective function to be Lipschitz continuous. Unfortunately, it seems difficult to establish whether L() satisfies this condition. We managed to circumvent the Lipschitz condition by proving that the Riemannian gradient of L() is bounded uniformly" -- I would expect that if the gradient is bounded (which is necessarily the case by compactness of the domain and continuity of the gradient), then the cost function is Lipschitz. Is that not the case here? What happens if we take X, Y two points in the set, and (since it is geodesically convex), we consider the unique geodesic connecting them, still in that set: let me call it c(t), with c(0) = X and c(1) = Y. Then, t -> f(c(t)) is a smooth function from R to R, hence we can apply the mean value theorem and claim there exists some t in (0, 1) such that f(Y) = f(X) + (f o c)'(t) = f(X) + . The gradient norm is bounded by G in the whole set, and c'(t) has norm equal to dist(X, Y) for all t since c is a geodesic, hence, by Cauchy-Schwarz: f(Y) - f(X) <= G dist(X, Y). Repeat with the roles of X and Y swapped, and we have found that f is Lipschitz continuous. Did I miss something?