NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 7064 Shallow RNN: Accurate Time-series Classification on Resource Constrained Devices

### Reviewer 1

Originality: The author propose a novel and general architecture that, to the best of my knowledge, has not been described before. Thus the idea of the "shallow" two layer RNN architecture as well as the accompanying theoretical analysis and experimental results are all novel. Quality: The claims appear correct, although I have some confidence in not having missed important issues only for claim 1 and 2. The experiments are comprehensive and instill confidence in the proposed architecture and theoretical guarantees. The code they provide appears about average for this type of research prototypes. Clarity. Most of the paper is clear and easy to follow. There are however a few typos and sentences that could be improved with some additional proof reading. (See below for some of the typos I spotted) Significance. The simplicity of the method combined with the well-motivated use case of embedded devices with constrained resources mean that I see this paper as a useful contribution, from which many are likely to benefit and thus worthy of NeurIPS. Question and comments: When running over 5 random seeds, what kind of variance is observed? It would be worth mentioning this at least the supp material, to get a sense of the statistical relevance of the results. 46: ensuring a small model size -> I believe the model size would not be smaller than that of a standard RNN, if so the claim appears a bit misleading Claim 1 appears correct as stated, but the formulation is a bit convoluted, in the sense that one typically would be given T and w, and can decide on a k; whereas in the current formulation it appears as if you are given a T and q and can pick an arbitrary k based on that, which is not really the case. Line 199: from this sentence it is not very clear how SRNN is combined with MI-RNN, it would be good to give a little more details given that all results use this model are based on a Shallow extension of MI-RNN. In the same vein the empirical analysis would be a little stronger if the results of SRNN without MI-RNN would be reported too. Minor: 37: standard -> a standard 81: main contributions -> our main contributions 90: receptive(sliding) -> [space is missing] 135: it looks like s starts at 0 where all other indices start at 1; including line 171 where s starts at 0 137: fairly small constant -> a fairly small constant 138: that is -> which is 139: tiny-devices -> tiny devices 152: I would find it slightly more readable if the first index v^{(2)} was 1 instead of T/k; if you need an index at all at this point 154: should be v^{(1)} not v^{(2)} 159: tru RNN -> a true RNN 159: principal -> principle 172: for integer -> for some integer 240: it's -> its 267: ablation study -> an ablation study latency budget of 120ms -> it's not clear to me where this exact limit comes from is it a limit of the device itself somehow? 318: steps of pionts threby 314: ully -> fully In the MI-RNN paper [10] they benchmark against GesturePod-6, where the current paper benchmarks against GesturePod-5, are they different? If so in what way?

### Reviewer 2

The authors propose shallow RNNs, an efficient architecture for time series classification. shallow RNNs can be parallelized as the time sequences are broken down into subsequences that can be processed independently from each other by copies of the same RNN. Their outputs are then passed to a similarly structured second layer. Multi-layer SRNN extends this to more than two layers. The paper includes both a runtime analysis (claims 1 and 2) and an analysis of the approximation accuracy of the shallow RNN compared to a traditional RNN. The idea is straight forward but the paper scores very low on clarity. The authors opt for symbol definitions instead of clear descriptions, especially in the claims. The claims are a central contribution of the paper BUT UNNECESSARILY HARD TO PARSE. The implications of the claims are not described by the authors. That's why I scored their significance as low. Here are specific points that are unclear from the paper: l.133-140 Shouldn't the amortized inference cost for each time step be C1 i.e. O(1)? Why would you rerun the RNN on each sliding window? l. 165 The heavy use of notation is distracting from getting an understanding of what window size w and partition size k you usually use. Is usually k larger than w or the other way around? This makes it hard to understand how the SRNN architecture interacts with streaming. When the data is coming in in streams, are the streams partitioned and the partitions distributed or are the streams distributed Claim 1 * You already defined $X^s$. Defining it here again just distracts from the claim. * q is the ration between w and k (hence it depends on k). It is weird that your statement relates k to q which depends on k. please explain. Claim 2 * Choice of k in Claim 2 seems incompatible with Claim 1. In Claim 1 k = O(sqrt(T)) in Claim 2 k = O(1). Claim 3 * What is M? What is $\Nabla^M_h$? Claim 3 and 4 * Are those bounds tight enough to be useful? Given a specific problem, can we compute how much accuracy we expect to lose by using a specific SRNN? * Can we use these bounds together with the runtime analysis of claims 1 and 2 to draw tradeoff between accuracy and inference cost like in Figure 2? To me the strength of this paper is the proposed model and ist implementation on small chips (video in the supplement) as well as the empirical study. I would have been curious for a discussion on how the proposed architecture relates to convolutional networks. It seems to me that by setting w small, k small and L large, you almost have a convolutional network where the filter is a small RNN instead of a typical filter. In the introduction, it is mentioned that CNNs are considered impractical. I am curious; could it be that in the regimes for which the accuracy of SRNN is acceptable (Claims 3 and 4) they are actually also impractical? Complexity similar to CNNs?

### Reviewer 3

Overall this is a well-written paper with proper motivation, clear design, and detailed theoretical and empirical analysis. The authors attempt to improve the inference efficiency of RNN models with limited computational resources while keeping the length of its receptive window. This is achieved by using a 2-layer RNN, whose first layer parallelly processes small bricks of the entire time series and the second layer gathers outputs from all bricks. The authors also extend SRNN in the streaming setting with similar inference complexity. One concern about the bound in Claim 1 in the streaming setting: In line 137: w is required to be fairly small constant independent of T. In line 166: w = k * q (w is a multiple of k, and thus k needs to be small constant) In line 173: The bound becomes O(\sqrt{qT} * C_1) iff k=\sqrt{T/q}, which is not o(1). Therefore, I was expecting analysis in practical applications with large T and small w. In SRNN, will the O(T/k) extra memory cost be an issue during inference? The extension of multi-layer SRNN in Section 3.2 provides at least O(log T) inference complexity. The bound here is too ideal, but it would be great to see empirically how SRNN performs by adding more shallow layers. The empirical improvements over LSTM and MI-RNN on multiple tasks are impressing. ==== Thanks for your responses. I have read the rebuttal and other reviewers' comments. I am glad to see about the experimental comparisons to CNNs and the refinement of your claims in the rebuttal, and I think including them in the manuscript or supplementary would better clarify and strengthen this paper. Overall this is a relatively simple yet effective solution to edge computing, which would keep becoming more important.