Paper ID: 521
Title: Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The grid LSTM would be worth looking at as well: (http://arxiv.org/abs/1507.01526)
Q2: Please summarize your review in 1-2 sentences
This is a well composed and neat application paper, applying a novel variant of the Long Short Term Memory network to a precipitation forecasting problem.

The authors do an excellent job of motivating the problem and use the LSTM to outperform the state-of-the-art approach from the application field.

Their version of the LSTM incorporates convolutions instead of standard fully connected layers, which seems novel and sensible.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
A convolutional LSTM feels like one of those obvious-in-hindsight ideas, but to the best of my knowledge no one has published one yet and I commend the authors for finding an interesting and appropriate problem to apply them to.

I am concerned about the bouncing MNIST comparison to [21].

At first glance it appears the results in this paper are significantly worse than in [21]; [21] reports cross entropy of the unconditional future predictor model as 3520 (I'm assuming they average over the 10 predicted frames), which is quite a bit lower than the best result in this paper.

There is an explanation on line 321 which indicates that this is because [21] does not split into train and test sets, but I'm not convinced this is a proper criticism.

The data set described in [21] is generated online, and it is very unlikely that any of the train and test sequences are identical.

Due to the (clearly intentional) similarity of the bouncing MNIST experiment to the one conducted in [21] I think it would be more sensible to follow their setup more closely. As it stands it is not clear how to compare these two experiments because of the differences in how data is generated; I can't tell if the higher cross entropy in this paper is an artifact of the restricted training set or is a result of poor execution.

Criticisms of MNIST aside, I would have preferred to see more focus on the nowcasting problem in the experiments section.

Spending less time on MNIST videos and more time explaining Figure 6 (which I do not understand at all) would make the paper much stronger. More illustrations of phenomena that the ConvLSTM captures where other nowcasting methods fail (e.g. line 407) would be welcome.

It would be interesting to consider also reconstructing the reverse input sequence jointly with future prediction as is done in [21].

Line 76 states that [18] does not consider convolutional state to state transitions, however the rCNN model from that paper does use 1x1 recurrent convolutions.
Q2: Please summarize your review in 1-2 sentences
This paper introduces the convolutional LSTM and applies it to an interesting problem.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper extends the recently proposed seq2seq encoder/decoder approach: the sequence of words is replaced by a sequence of images, and convolutions are used between several layers of a stacked LSTM and between the internal layers. This is a quite interesting application for the seq2seq framework, but I wonder whether this can be used for longer sequences, i.e. 20-100 images. The consecutive application of a convolution every time step could potentially lead to degenerated images (e.g. with a convolution that applies an average operator). On the other hand, I can imagine operators to detect movements of group of pixels which seems to be needed for the selected applications. The experimental results seem to indicate that the approaches works well in practice.
Q2: Please summarize your review in 1-2 sentences
Prediction of a sequence of images given a sequence of images using convolution LSTM.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper describes an extension to the fully connected long short-term memory models for the purpose of prediction of doppler radar echo for a region over a short period of time given the recent doppler radar echo images.

The proposed framework extends LSTMs to handle the temporal evolution of spatial fields.

For the experiments, the paper considers two set ups, an artificial one based on the movements of handwritten digits, and a data set of Doppler radar images for unspecified region and time.

Quality: The proposed extension ConvLSTM appears to be mathematically sound and appropriate for the modeling of regularly timed spatio-temporal data.

The paper provides only a basic formulation of the framework; there are few details on the training of the model -- it is assumed that the reader is familiar with how FC-LSTMs are trained.

How does the training of ConvLSTM differ?

There are too few details provided about the precipitation data to judge whether the set up is appropriate.

It appears to be somewhat artificial to me in the following aspects: (1) storms can be fast moving (e.g., Midwest of the United States), so including the data only for the subregion of prediction may ignore the crucial information about the storm if it falls outside of the prediction area; (2) only days with precipitation are being used as a part of the data -- this appears unreasonable for a real application, (3) other covariates (e.g., time of day) are not taken into an account.

Additionally, to demonstrate an impact for the application, it would be appropriate to compare the predictive ability not just to other machine learning methods, but the standard methods in the application area, numerical weather prediction models (https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/numerical-weather-prediction).

Clarity: The paper tries to cover quite a bit of ground, unfortunately, at the expense of clarity.

The details on the proposed model are fairly bare.

I would suggest to focus either on the application (and in this case, skip the dancing digits while elaborating more on the primary application) or on the methodology (and provide more details on the property of the new model at the expense of the details on the nowcasting application).

Originality: There are original contribution to the modeling framework and in applying it to the short-term precipitation forecasting.

Personally, I am glad to read that deep learning techniques are reaching precipitation modeling research area.

Significance: I do not have the expertise to judge whether the incremental leap between FC-LSTM and ConvLSTM is significant.

For the nowcasting application, while the proposed framework can be a useful intermediate step, due to the reservations mentioned under Quality section, I doubt it is close to solving the problem.

Minor: - Please check equation (3); I think some of the Hadamard products should actually be convolution operators. - Page 1, The acronym for numerical weather forecasting is NWF.

Did you mean numerical weather prediction? - Footnote on page 1, "plan"->"plane"?
Q2: Please summarize your review in 1-2 sentences
The paper proposes an extension to a long short-term memory model to handle spatio-temporal data and attempts to apply this new framework for a short-term prediction of rainfall.

However, the set up is not convincing from the stand-point of application, and the paper is light on detail on the conceptual contribution; in the opinion of this reviewer, the paper would benefit from focusing on either the methodology or the application.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank all reviewers for their constructive comments and suggestions. We will revise our paper accordingly.

For R1:
Thanks for your positive review.

Q1:...MNIST...focus...
Sorry for the confusion caused by our improper criticism of [21]. A limited number of training/testing split instead of online training is used in the paper to help understand the performance of different models in an offline setting where not so much data is available. We will emphasize this difference and remove some of the MNIST experiments to leave more space for the background and experiment analysis of the nowcasting problem in the revised paper.
In the online setting, after seeing 256000 sequences, the loss of ConvLSTM reaches 3400 while that of FC-LSTM is above 4000. It takes 1280000 cases for FC-LSTM to reach 3400.

Q2:Fig.6
Fig.6 shows two prediction examples generated by ConvLSTM and ROVER. Both models use the 5 observed radar echo maps to predict the future 15 echo maps. The top row shows the 5 input frames. The second row shows the ground truth frames sampled with an interval of 3 (resulting in 5 frames). The third and the last rows separately show the prediction results of ConvLSTM and ROVER (also sampled). In both examples, there is sudden agglomeration of rainfall in the top-left region. ConvLSTM has predicted some of these outside rainfalls while ROVER just blanks out them.

For R2:
Thanks for your review.

Q1:1D vector
We will revise our presentation to be more rigorous. Here we are trying to point out that FC-LSTM has not encoded any spatial information in its structure. For FC-LSTM, at a certain timestamp, every pair of input/state & state/state elements are connected and the state elements can be randomly permuted without affecting the overall topology, making the whole structure contain no spatial information. However, for spatiotemporal data studied in the paper, strong correlations exist between spatially and temporally nearby elements, which makes such structure not so appropriate.

Q2:Contribution
1) Our paper is the first to propose a variant of LSTM with convolutional recurrence. The difference between ConvLSTM and rCNN is that rCNN is based on vanilla RNN instead of LSTM and it constrains the state-to-state kernel size to 1X1 while we have shown that ConvLSTM can perform better with larger kernels (Section 4.1). For convolutional recurrence, there is a huge difference between 1X1 kernel and larger kernels. After unfolding along the temporal direction, convolutional recurrence becomes a deep convolutional neural network with shared kernels. For 1X1 kernel, the receptive field of states will not grow as time advances. While for larger kernels, later states (lying in deeper layers) will have wider receptive fields.
2) Our paper has built the first end-to-end trainable model for precipitation nowcasting. The model is different from [21] in that our building block is ConvLSTM and our task is spatiotemporal sequence forecasting instead of representation learning. Also, it is different from [18] in that [18] only considers one-step prediction while our target is multi-step and our model is end-to-end trainable.

For R3:
Thanks for your detailed review.

Q1:Training
We use BPTT + RMSProp (Line 262-264).

Q2:...subregion...
Our target application is precipitation nowcasting for the future 1.5 hours over a local city (Section 2.1 & 4.2). For this problem, the operational system SWIRLS which we compare uses only the past radar echo maps of the same region to nowcast the future. We agree that including information from the outside area will be useful, but it is currently difficult for us to obtain them. Also, since there are not so many rainy days in our data and our modeling target is precipitation, we select the rainy days.

Q3:Other covariates
This is a good future direction. In fact, the operational ROVER algorithm also has not included this type of information. Since our model is end-to-end trainable, additional features can be added as new channels to the input and we can train the network and get predictions using the same algorithm.

Q4:Comparison
In the paper, we have compared our model with the ROVER algorithm used in HKO. For NWP models, due to time required to gather observations, data assimilation and attaining physical balance among numerical processes, the first 1-2 hours of model forecasts may not be available or suitable for direct comparison with nowcasting products. We will explore this idea in the future.

Q5:Focus
We will remove some of the MNIST experiments to leave more space for the background and experiment analysis of the nowcasting problem and our ConvLSTM model in the revised paper.

Q6:Minor
1) In the first two lines of (2), operators before c_{t-1} should be Hadamard product and (3) is correct
2) We mean numerical weather prediction
3) Should be plan

For R4, R5, R6:
Thanks for your positive reviews. We will experiment on longer sequences in the future.