
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This work addresses the question of how to improve the invariance properties of Convolutional Neural Networks. It introduces the socalled spatial transformer, a layer that performs an adaptive warping of incoming feature maps, thus generalizing the recent attention mechanisms for images. The resulting model requires no extra supervision and is trained backtoback using backpropagation, leading to stateoftheart results on several classification tasks.
The paper is clearly written and its main contribution, the spatial transformer layer, is valuable for its novelty, simplicity and effectiveness. The related work section covers most relevant literature, except perhaps recent works that combine deformable parts models with CNNs (see for example "Deformable Part Models are Convolutional Neural Networks", "EndtoEnd Integration of a Convolution Network, Deformable Parts Model and NonMaximum Suppression" both at cvpr 2015), since they also incorporate an inference over deformation or registration parameters, as in the spatial transformer case. The numerical experiments cover some relatively challenging datasets (although not large scale tasks such as Imagenet or general object recognition) which highlight the gain produced by the registration layers.
Here are some detailed comments:
> relationship with a general bilinear model. The spatial transformer applies a linear transform A(x,theta(x)) to the input feature map x, which is itself indexed by a lowdimensional vector of parameters theta(x) that are adapted to the input. This is a particular instance of a general bilinear map, in which a given family of transformations (warpings or affine maps) is considered. Looking at the experiments, it seems as if the transformer is asking for more capacity, since the richer class (thin plate spline) systematically improves over simpler models of transformations. One could then attempt to make the bilinear model a bit more expressive by replacing the class of warping operators by a learnt subfamily of linear operators. Another related comment is whether it would be interesting to consider 3D warpings across feature maps as well. That could perhaps allow the network to model relationships across features with extra capacity.
> relationship with data augmentation. In order to increase the invariance/stability of the network to large displacements or transformations out of the scope of the pooling layers, one typically resorts to data augmentation. How does the spatial transformer relate to this method? can they work together? One could argue that data augmentation uses prior information, whereas the spatial transformer does not assume classes to be invariant to known transformations.
> Another comment is whether one could use the learnt registration parameters to construct adversarial examples/data augmentation. In the model, there is a relationship between the socalled localisation parameters theta(x) and x. If say x'= T_beta(x), where T_beta is assumed to belong to the family currently implemented by the spatial transformer,
then it is reasonable to expect to have theta(x') = theta(x)+beta, assuming also a group structure in the space of transformations. In any case, the localisation net should satisfy T_{theta(x)} T_beta = T_{theta( T_beta(x) ) } for all x and beta. Could this property help regularize the localisation net?
> The paper presents arguments in favor of the spatial transformers together with experiments that show its efficiency. However, the authors do not mention any scenario where the method could fail, or even hurt the performance of vanilla CNNs. When does it fail, or what are its limitations? For instance, aliasing can be potentially dangerous, as pointed out briefly, but also bilinear interpolation is known to be a bad kernel if one needs to preserve high frequency content, which might be important in, say, texture classification or complex object recognition tasks.
Q2: Please summarize your review in 12 sentences
This paper introduces a trainable layer that transforms a feature map following a parametrized diffeomorphism. When used within
Convolutional networks it improves stateoftheart results on several benchmarks. This work is appealing since it improves the amount of invariance CNN can learn with little extra computational and learning cost.
Despite some remarks relating to its scalability and learning complexity, I recommend this work for publication.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a novel layer that can be used in convolutional neural networks. A spatial transformer layer computes resampling points of the signal based on another neural network. The suggested transformations include scaling, cropping, rotations and nonrigid deformation whose paramerters are trained endtoend with the rest of the model. The resulting resampling grid is then used to create a new representation of the underlying signal through bilinear or nearest neighbor interpolation. This has interesting implications: the network can learn to colocate objects in a set of images that all contain the same object, the transformation parameter localize the attention area explicitly, fine data resolution is restricted to areas important for the task. Furthermore, the model improves over previous stateoftheart on a number of tasks.
Strength and weaknesses:
+ Interesting approach to jointly solve for certain variation in data.
+ Wide set of experiments that show the performance of the proposed method.
The mathematical derivation is sound, the paper is well written. This work investigates the important problem of encoding invariances in CNN archtectures and therefore is of relevance.
Some questions to the authors:
 It is curious that the latent representation that is learned is so similar to the undistorted data. Do you have any intutions about this? Is this due to the selected class of transformations? Did you observe other local solutions that resulted in representation that do not have a clear semantic explanation like the ones presented in the paper?
 Another perspective on this is that this extends the idea of a spatial convolution in CNNs. Something similar  though not learnable in the resampling  has be presented in "Spectral Networks and Locally Connected Networks on Graphs" (Bruna et al. ICLR 2014), "Sparse Convolutional Networks using the Permutohedral Lattice" (Kiefel et al., arXiv:1503.04949). I would like to see related work like "Modeling Local and Global Deformations in Deep Learning: Epitomic Convolution, Multiple Instance Learning, and Sliding Window Detection" (Papandreou et al., CVPR 2015) be discussed. They also try to tackle the problem of deformations in the input data.
 This work seems to only use the newly defined layer directly on the distorted inputs. What happens when you apply in later layers of the network?
 Do you release the code for training and evaluation?
Q2: Please summarize your review in 12 sentences
This paper extents common convolutional neural network architectures by a spatial transformation layer. This layer resamples the image/patch according to a parameterized operation like cropping, scaling, rotation,..., etc. The presented work is applied in a wide range of experiments to MNIST, Street View House Numbers, and the CUB200 birds dataset.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proposes a novel module for CNNs that allows for endtoend learning of parametric transformations of the inputs (the image or output feature channels from a previous layer). The module has one mini neural network that regresses on the parameters of a parametric transformation, e.g. affine), then there is a module that applies the transformation to a regular grid and a third more or less "reads off" the values in the transformed positions and maps them to a regular grid, hence underforming the image or previous layer. Gradients for backpropagation in a few cases are derived. The results are mostly of the classic deep learning variety, including mnist and svhn, but there is also the finegrained birds dataset. The networks with spatial transformers seem to lead to improved results in all cases.
Quality:
The paper has quality, both in terms of presentation and content. All seems correct and pretty.
Clarity:
Extremely clear.
Originality:
It is novel as I can tell. The critical new trick seems to be transforming parametrically a sampling grid which makes it possible to learn endtoend with backpropagation, unlike previous work using reinforcement learning.
Significance:
It is not completely clear that it will be significant. Will it work on complex vision datasets having multiple objects ? That is the question. The birds experiment is the one giving the most hope, but bird parts are mostly a torso and a head. Only more experiments will allow to accurately gauge significance. In any case people should be interested.
Q2: Please summarize your review in 12 sentences
The idea is interesting and backed up with many experiments, the paper is well written and clear, and good.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
Thanks to the reviewers for the insightful comments
and questions, we will endeavour to incorporate them in the final
version.
R1, R3: "recent works...[deformable parts models, epitomic
CNNs]" Thanks for the references, these papers are good to mention as
they implement other types of spatial models in a CNN framework, so
references will be added.
R1: "attempt to make the bilinear model a
bit more expressive by..." Yes indeed  we hint at line 199 that this
general bilinear form can be completely learned and can be shaped by
imposing different regularisation (e.g. low rank). In fact one could
attempt to learn a model which regresses the grid directly without any
structure imposed on the transformation, however in practice we find this
suffers from noisy gradients. Any other differentiable transformation
function T_theta could be used instead  this could be useful if one has
some prior on the expected transformations to be incorporated into the
spatial transformer network.
R1: "consider 3D warpings across
feature maps" We agree that this is an interesting and natural
extension that should be looked into.
R1: "relationship with data
augmentation" A spatial transformer offers a way to achieve
equivariance/invariance which is complementary to data augmentation.
Instead of modelling invariance with pooling layers and also potentially
modelling multiple transformed versions of data with the parameters of the
network, a spatial transformer explicitly disentangles the data from its
transformation, achieving data invariance and transformation equivariance.
But data augmentation is still relevant to STNs, as it extends the
training set, helping the STN to learn a better transformation
model.
R1: "Could the groundtruth transformation (from data
augmentation) help regularize the localisation net?" Incorporating
additional constraints and/or losses on the localisation net (such as the
one suggested) would certainly be an interesting direction to
explore.
R1: "When does it fail or what are its limitations" We
haven't found any scenarios where it hurts performance, though training
stability can be an issue. Preliminary experiments on ImageNet have shown
that it is scalable but doesn't affect the accuracy, as the identity
transform (or some small perturbation on the identity) is learned
(possibly due to ImageNet requiring the use of background context). The
bilinear kernel could indeed be a bottleneck, so other kernels should be
explored.
R3: "latent representation that is learned is so similar
to the undistorted data" This is often found to be the case. We
comment on this on line 343. This is probably due to the fact that the
undistorted pose is the mean pose of the distorted MNIST datasets. The
mean pose is the pose requiring the least extreme transformation across
all data samples, so it is the easiest to learn. However, we have observed
that due to random ordering of presented training data, sometimes a random
bias towards a nonmean pose of training data can cause a bias in the
localisation network to transform to a pose that is not the same as the
undistorted data.
R3: "What happens when you apply the STN in the
later layers of the network?" We make use of transformers at later
layers in the network in our SVHN experiments (the STCNN Multi model, see
Table 2 right (a)). It allows us to use much smaller localisation networks
as the features, extracted deep in the network, can encode the object
pose. Though not presented, we performed similar experiments on MNIST
(with the transformer after the first convolutional layer) and this gave
very similar results.
R5: "Table 1: could show error rates on MNIST
without distortions too" Thanks, we will look at adding this to the
final version.
R5: "fully connected transformer layers... will blow
up number of parameters" Since there are only 32 units in the FC layers
of the localisation nets for the STCNN Multi model, it is not a huge
increase in the number of parameters. The base CNN has 28M params, the
STCNN Single has 29M params, and the STCNN Multi has 31M params. The
extra params for STCNN Multi are 300k from ST1, 1.1M from ST2, 1.5M from
ST3, and 700k from ST4. We will be more explicit on the number of
parameters. 
