Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper brings a powerful dynamical systems perspective to bear on recurrent models. The universal approximation theory and results on incrementally stable models offer theoretical support for why Bai et al. 2018 and others can outperform recurrent models with feedforward/TCN architectures, and the approximation results nicely generalize those of Miller and Hardt 2019. In particular, the approximation theorem avoids relying on a state-space representation, and the Theorem 4.1 showing recurrent models can have approximately finite memory goes through without the strong exponential contractivity assumption. The paper is clearly written, and I checked most of the proofs for correctness. In addition to the results themselves, the paper offers a nice bridge between recurrent models and powerful results, tools, and definitions in dynamical systems and controls that hopefully inspire more interplay between the two areas in the future. Since the results in the paper are quite general, I would have appreciated instantiating them with a few specific parameterizations. For instance, the paper derives the Miller/Hardt result as a special case of a more powerful result. Are there nice examples where models fail the more restrictive stability condition, but satisfy incremental stability and show approximately finite memory? Finally, is the exponential dependence on depth in theorem 3.1 inevitable or an artifact of the construction via Hanin and Selke 2018? Typos: Equation 17: f(\xi), u should be f(\xi, u) and similarly later in the proof. After rebuttal: Thanks for the authors for their response. I remain a fan of this paper. The specific example given showing a separation between the contractivity condition and the conditions studied in the paper is interesting, and the camera-ready version might benefit from including this and other examples, or at least linking to them the appendix for the interested reader.
Overall I thought the paper was quite clear: the definitions were precise and given in a timely manner, and the theorems followed in a reasonable order with plenty of motivating / explanatory remarks. I am less sure about how deep / novel the results are however. On Theorem 3.1 - I think this is a relatively trivial application of [Hanin and Sellke, 2018], and so does not contribute much to the body of ML knowledge? The result of Hanin and Sellke says that "we can well-approximate functions on m variables with a net of width m+1 and some depth" - and Theorem 3.1 says that "if you have a function on an infinite number of variables (i/o map) that only uses approximately local context of m variables, then you can well-approximate it by a (convolutional) net of width m+1 and some depth", which follows in a pretty straightforward manner. Some other remarks: Definition of mF*(0): for this to be well-defined, it should be "<= epsilon", not "< epsilon", in defition 2.1 (equation 1). I suggest sharing same number counter for propositions, theorems, definitions, etc. Otherwise there is a remark, theorem, assumption, proposition, and definition all with the label 3.1, which is slightly confusing. On definition of time-invariant: there is an annoying condition here on "boundary effects" at t=0. In particular it requires that F(0, x_1, x_2, ...) = 0, F(x_1, x_2, ..) (also: does k>1 follow from k=1?), but recurrent systems (as defined at the beginning of section 4) only satisfy this if they ignore initial sequences of zero, i.e., with the extra conditions that f(\xi, 0) = 0 and g(\xi) = \xi where \xi \in R^n is the initial state. Therefore the time-invariant defition, or the conditions on recurrent models, need to be adjusted.
Summary of main ideas: This paper consists of three theorems. Theorem 3.1 states that a ReLU TCN can $\epsilon$-approximate (in $L^\infty$ sense) any causal and time-invariant i/o map F with (Assumption 3.1) approximately finite memory and (Assumption 3.2) an additional continuity condition. Theorem 4.1 states that a state-space model with (Assumption 4.1) sufficiently smooth transition/output maps, (Assumption 4.2) compact state spaces, and (Assumption 4.3) ``uniformly asymptotically incrementally stability'' generates a causal and time-invariant i/o map F. Theorem 4.2 (and Corollary 4.1) provides detailed estimates of the memory and continuity against the norm $R$ of inputs, the Lipschitz constants $L_f$ (of transition model) and $L_g$ (of observation model), under some assumptions including ``Demidovich criterion,'' which contains the conditions assumed in [Miller and Hardt, 2019] as a special case. Originality: High. The theorems are new and consistent. Quality: High. The most technical part (Theorem 3.1) seems to be imported from [Hanin and Sellke, 2018]; but the total message goes beyond the previous work. Clarity: High. The writing is good. Significance: Medium. Since this study is heavily motivated by [Miler and Hardt, 2019], I would raise my score if the authors could answer the learnability question: What class of i/o maps a TCN can learn during gradient descent training? After rebuttal: I appreciate the authors' reply. I agree that the authors are right that the learnability arguments need explicit parameterizations, but surely it is out-of-the-scope in this study. I will raise my score because I am also moved by the R1's strong phrase "I expect to see follow-up work exploiting these definitions."