Paper ID: 1690
Title: Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors set out to solve a problem that the highly promising architecture of MDLSTM-RNNs have, which is the ability to parallelise computations. A novel sequential flow of information is proposed, which arguably makes parallelisation much easier.

Major issues ===

i) MDRNNs are not treated fairly. Fig 1 a) and Fig 2 left are dangerously misleading: they make the impression that it takes four MDRNNs to scan a whole 2d plane. But this is not true, the scanning can continue from the center point to the lower right corner, and the prediction can be made there. Consequently, a single sweep of a MDRNN sees *all* the pixels, a theoretical benefit of MDRNNs over Pyramid RNNs, for which this property is not shown.

ii) The authors make clear that their computational flow, that of a Pyramid, introduces some independencies which make parallelization easier. What they do not is to elaborate on the consequences of this from a statistical view--i.e. what assumptions are made then.

E.g. if we look at Fig 1 c), the pixels which are neighbours along the columns are treated as independent. This seems as a big restriction to me and I think should be discussed in the text--e.g. is it possible to undo this dependency? (see next point)

iii) The C-LSTM layer is not justified, the authors do not give reasons why they do it. I can imagine that it helps to make up for the dependencies that the Pyramid layer is lacking--but the authors should make clear why this is done.

iv) It is not explained very well how the segmentation is actually done. The paper makes the impression that a single pass over a volume forms a single prediction--the output of the softmax. But what if many pixels are to be segmented? Do the authors propose to sweep

over the volume once per segmentation label? Further, why are the authors chosing the MSE as a training criterion, opposed to the negative log-likelihood of the categorical distribution induced by the softmax?

v) The experiments are very domain specific, while the method is not. I don't see why no results on more standard benchmarks are presented, e.g. Pascal VOC. This would add a lot of value to the paper, as it eases the reader's effort to evaluate the method for applications in different contexts.

Minor issues ===

i) I think that Lucas Theis' work on generative modelling with spatial LSTMs [2] might (!) be of interest for the related work section.

ii) The LSTM explanation is poor. The x symbol is overloaded, it is not clear what dimensionality the different quantities have, \text{} is not used in the subscripts, also the reader has to guess what x is. It should be easy to improve this to take cognitive burden from the reader.

iii) Introduction, "neighbouring" -> "predecessing".

iv) The correct cite for dropout on the non-recurrent connections of RNNs is [1], who did that earlier than Google.

[1] Pham, Vu, et al. "Dropout improves recurrent neural networks for handwriting recognition." Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on. IEEE, 2014.

[2] @unpublished{Theis2015c,

author = "L. Theis and M. Bethge",

title = "Generative Image Modeling Using Spatial LSTMs",

year = 2015,

journal = "arXiv",

month = "Dec",

keywords = "deep learning, generative modeling, natural image statistics, lstm, mcgsm",

url = "http://arxiv.org/abs/1506.03478/" }
Q2: Please summarize your review in 1-2 sentences
While the paper addresses an important issue with MD-LSTM, the paper is coming short in some aspects which I do not believe can be fixed in a camera ready version. Especially, i) some of the architectural decisions are not justified, ii) not explained clearly enough and the iii) the experimental validation makes it hard for the reader to place the method in the map of alternatives.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Paper is about applying LSTM-type models to biomedical data. But beyond this, the contribution of the paper is unclear. On one hand it argues that the pyramid-configuration is good because it makes it GPU compatible. But to make this case, the following are needed: (a) showing speedups on GPU vs CPU for this model; (b) demonstration that changing the topology does not significantly impair performance, relative to existing multiD-LSTMs. The paper has neither of these. Alternatively, the paper could be viewed as applying a novel NN approach to bio-medical data and obtaining good results. The issue here is (a) no baseline comparisons are made to existing (and much simpler) NN approaches, like 2D/3D convnets, or indeed the standard (slow) multi-D LSTMs; (b) the results obtained are good, but not comprehensively better than the existing approaches.

Thus I don't think the paper really says anything useful to a machine learning audience. It might be of interested to a biomedical audience (although a critical observer there would also surely want to see what a simple convnet would do).

The pyramid reformulation itself is quite interesting, but without experiments to show that it is an efficient replacement for the existing multi-D LSTMs, it is unclear if it is a good idea or not.

The paper is clearly written. The related work seems fine (given my limited knowledge of the biomedical literature).

The experiments on the various datasets show good numbers. Although it isn't clear, I presume many of the existing approaches are not using deep nets, thus probably represent rather soft targets to beat. This is why the lack of a simple deep convnet baseline is an issue: it may also beat out these previously published methods.
Q2: Please summarize your review in 1-2 sentences
Paper proposes a new topology for multi-D LSTMs, where they are arranged in a pyramid structure making the model GPU compatible. It is applied to a variety of biomedical data, showing good results. Overall point of paper is unclear: is about GPUifying multi-D LSTMs, or about beating existing non-NN approaches? Experiments don't compellingly make either case.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper provides a parallel implementation of recurrent neural networks, which requires a non-trivial adaptation of existing training algorithms.

Experiments demonstrate the ideas in practice on two medical image analysis problems and show promising results.

The idea appears sound and the experimental results show significant promise. The paper is well presented and the key ideas come across clearly. This is not a topic I am particularly familiar with, so I cannot be certain of the originality, but from the authors summary of the literature it appears to be original. The contribution is potentially significant, because it enables the training of deep networks in problems where context and recurrence are important, such as the image analysis problems the authors demonstrate with.
Q2: Please summarize your review in 1-2 sentences
A decent and timely contribution. This is a non-trivial solution to a timely parallelisation problem enhancing deep learning on problems that require context of unknown scale.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
I doubt the LSTM can (as mentioned in the abstract) "the entire spatio-temporal context of each pixel in a few sweeps through all pixels". The memory is, in practice, never quite large enough!

I applaud the authors for working on a medical dataset. For some comparison to other recent segmentation methods it would be nice though to also compare on more standard benchmark datasets.
Q2: Please summarize your review in 1-2 sentences
This paper introduces a novel multidimensional pyramid LSTM model and applies it to a very interesting task in medical image segmentation.

The model is also parallelizable on GPUs which is a major improvement over previous methods.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank all reviewers for their valuable feedback. Below we respond to general and author-specific concerns:


General:

We would like to clarify our results:
- We apply the exact same architecture on both datasets. This is significant since the competing algorithms are very specialized.
- On the EM dataset, we outperform a state-of-the-art deep convolutional neural network (NN) in Rand Error, the most important error measure for this dataset (see 'IDSIA' in table 1; these authors are experts in NNs and won several competitions).
We were not able to apply the SCI post-processing on our dataset (SCI authors did not reply), which is tuned to improve rand-error further. We still report the SCI post-processing on the output of other methods for completeness ('IDSIA' and 'DIVE').
- On the brain dataset, other teams applied various pre-processing and post-processing to adapt their methods specifically for MR image segmentation. On the contrary, our approach is not optimized in the specific domain and only simple pre-processing is applied.
The organizers of mrbrain13 noted that especially the performance on cerebral spinal fluid was impressive.
- We chose biomedical images since they are varied and challenging, and provided segmented volumetric images. In the biomedical field and 3D volumetric image segmentation is a challenging task in the field. The datasets used for our evaluation are large and images are very noisy and complex.


Reviewer_1:
Major:
We appreciate the detailed feedback. We think there is some misunderstanding between segmentation and classification, that causes several problems, we try to explain this below.

i) Yes, your example works if it was a single classification. But we want pixels-wise classifications (segmentation) using the cell outputs at every pixels position. Then at the middle-pixel, as shown in Fig 1a, you only get information from part of the image and have to combine six sweeps in all directions.


ii) We discuss the effects of our topology at several places in the paper. All pixels in this system get information from all pixels in this system, and convolutions model neighbouring pixels over all axis (the arrows in 1c show the convolutions, we will make this more clear).


iii) The C-LSTM layer is not an extra layer, it is the workhorse of PyraMiD-LSTM. It performs the convolution computations as shown in Fig 1c, and are combined as in Fig 3
(the C-LSTM is shown in the drawing).


iv) Please see the answer i). A good question, we chose MSE after getting better results with it, we will note this in the paper.

v) Please note the two biomedical datasets are very varied, we agree we like more experiments in other domains, but it is already an achievement to run deep LSTMs on these scales with high speed on these volumes. There are also not many volumetric segmentation datasets (Pascal is 2d, not 3d).

Minor:
We will improve the citations and make LSTM explanation clear. A good point, the domains are only defined in C-LSTM, not normal LSTM. We will fix this.

Reviewer_2:
We agree with the comments.

Reviewer_3:
We believe the method is in essence simple, but LSTM equations are known to be complicated. We tried to keep it simple by focusing on clear formulas, but we will try to make more clear.

Getting state-of-the-art results on these datasets is not really negative, the competing algorithms are state-of-the-art random forests models (on mrbrain13) and deep neural networks (on membrane segmentation), these are not run-of-the-mill algorithms.
These datasets are large, comprehensive and complicated tasks, we thus would not call these results preliminary.

We will fix the CSF acronyms.

Reviewer_4:
Good point, the idea of LSTM is to propagate information very far and thus learn to combine information selectively. The last layer has 64 values per pixel so we hope this allows for enough memory (this was currently the limit on our hardware).
We are planning to try video-data as future work.

Reviewer_5:
Please see the general comments above about the comparison with NN approaches and the choice of biomedical datasets.

- GPU vs CPU: We are able to parallelize convolution operations (with CUDNN) because of our model, this would not be possible in MD-LSTM. This is much faster as is well known for convolutions. We will make this argument more clear in the paper.

- PyraMid-LSTM vs MD-LSTM: This would indeed be a great comparison, but prohibitive in computation time. The experiments on these volumetric datasets are simply not possible with current MD-LSTM implementations, but doable with PyraMiD-LSTM.

- EM images are evaluated with state-of-the-art convolutional NN, please see general remarks.
The competing algorithms in brain analysis are state-of-the-art computer vision algorithms from big research groups.

Reviewer_6:
We agree the method can be of broad interest.