
Submitted by
Assigned_Reviewer_4
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The authors propose a new deep architecture, which
combines the hierarchy of deep learning with timeseries modelling known
from HMMs or recurrent neural networks. The proposed training algorithm
builds the network layerbylayer using supervised (pre)training a
nextletter prediction objective. The experiments demonstrate that after
training very large networks for about 10 days, the network performance on
a Wikipedia dataset published by Hinton et al. improves over previous
work. The authors then proceed to analyze and discuss details of how the
network approaches its task. For example, longterm dependencies are
modelled in higher layers, correspondence between opening and closing
parenthesis are modelled as a “pseudostable attractorlike state”.
The paper is of good quality. It is wellwritten, experiments are
set up well and the figures support the findings well. My main issue with
the paper is the experiments. The comparison is done on one dataset only,
which does not seem to have much work on it. Direct comparison is only
with other RNNrelated models, not with maybe more common hierarchical
HMMs etc., which are able to solve similar tasks (Tab. 1 lists other
approaches, but on a different corpus, which makes them incomparable).
The contribution of the paper is more in the analysis of /how/ the
network operates (which is analyzed quite well) than in what it achieves,
but this is not what the paper sets out for. Q2: Please
summarize your review in 12 sentences
A new deep architecture is proposed, combining
hierarchical features of deep learning with timeseries modelling. The
paper is wellwritten, the model is analyzed well, but results are
somewhat inconclusive. Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
REPLY TO REBUTTAL:  I mostly agree with the one
task comment.  It would still be worth trying truncated backprop with
e.g., 30 steps. It is less likely to learn to balance parentheses (but it
would be good to verify that), but it may still achieve competitive
entropies at a fraction of the cost.
The usefulness of deep
recurrent neural networks is investigated. The paper reaches
interesting conclusions: when holding the number of parameter
constant, deeper recurrent neural networks outperform standard RNNs.
Other interesting findings include an analysis that shows the deeper
networks to contain more "long term information", and the result that
plain BPTT can train characterlevel RNNs to balance parentheses,
where previously it was thought to be possible only with HF.
But it claims to introduce the deep RNN, but those been used
before; for example, in
http://www.cs.toronto.edu/~graves/icassp_2013.pdf, and likely much
earlier than that. So the paper should not claim to have introduced this
architecture. It is OK to not introduce new architectures and to focus
on an analysis.
The analysis, while meaningful, would be even
more interesting if it were shown on several problems, perhaps
something on speech. Otherwise the findings may be an artifact of the
problem.
Finally, the HF experiments were likely too expensive,
since the truncated backprop approach (that was introduced in the 90)
was successfully used by Mikolov (see his PhD thesis) to train RNNs
to be excellent language models. Thus it is likely that truncated
BPTT would do good here too, and it would be nice to obtain a
confirmation of this fact. Q2: Please summarize your
review in 12 sentences
This work presents an interesting analysis of deep
RNNs. The two main results are: 1) deep RNNs outperform standard RNNs when
the number of parameters is fixed on characterlevel language modelling;
2) the deeper layers exhibit more longrange structure; and 3) truncated
BPTT can train characterlevel RNNs to balance parentheses.
The
results are interesting, but given that it is only an analysis, the paper
would be a lot stronger if it had a similar analysis on some speech task
(for example, to show that deeper RNNs do better, and to show that their
deeper layers have more long range information)> Submitted
by Assigned_Reviewer_7
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
I like the point about the arbitrariness of older data
benefiting from potentially many more layer of processing compared to
newer data. And building a deep RNN seems like a reasonable way to do
better on shorter term stuff, in addition to possible other benefits.
Another obvious thing to do would be to put extra layers of nonlinearities
between each timestep. Have you tried this?
How important was the
gradient normalizations and why does this "avoid of the problem of
bifurcations"? There is a recent paper in ICML 2013 by researchers from U
of Montreal that looks at gradient truncation and optimization in RNNs
that that may be relevant here. Also, will normalizing the gradient in
this way potentially mess up the "averaging" behavior of SGD?
In
terms of previous results on these kinds of Wikipedia compression tasks,
there is also some work by Mikolov et al. that you may want to compare to.
Of the various experiments designed to examine the different roles
played by each layer in terms of timescale and perhaps "abstraction", the
one I find most persuasive is the text generation one (as shown in Table
2). However, as pointed out by the authors themselves in the paragraph on
line 241, it may be problematic to interpret the effect of this kind of
"brain surgery" on RNNs due to the complex interdependencies that may have
developed between the outputs of the various layers.
For me,
the biggest missing piece of the empirical puzzle in this paper is the
question of whether the higher layers are actually *better* at processing
more abstract and longterm properties of the text than if the units were
moved to the first layer. i.e. are they benefiting from the extra levels
of processing that proceed them in a nontrivial way? That they happen to
take on these seemingly more abstract and longer term roles after training
is good but incomplete evidence that this is the case.
I notice
that the regular RNN seems to have less parameters in these experiments
since 2*767^2 * 5 > 2119^2 (I'm counting both recurrent and interlayer
weight matrices, hence the multiplication by 2), so the comparison might
be a bit unfair. A more convincing would be if the deeper RNNs did better
than a standard RNN with the same number of parameters, or even better,
the same number of *units*. Say a 2 layer DRNN versus such an RNN with the
same number of units, where the different in the number of parameters
wouldn't favor the RNN so much that the comparison would be rendered
unfair. This would be strengthen the paper's claims a lot in my opinion.
Also, instead of Figure 2, it would be better to have seen how
well various depths of DRNN did on the benchmarks when trained from
scratch, possibly with wider layers than their deeper counterparts to make
up for the difference the numbers of parameters and/or units.
Minor:  You should define DRNNAO and IO in the text
somewhere and not just in Figure 1. Q2: Please summarize
your review in 12 sentences
This paper looks at a hybrid of deep and recurrent
neural networks called DRNNs, which are like deep networks but with
recurrent connections at each layer that operate through time. The authors
show how such an architecture can work very well for text
prediction/compression compared to existing approaches.
A large
bulk of this paper is devoted to a series of experiments designed argue
that the higher level layers are processing/representing more abstract and
longterm structures in the data. These experiments are pretty convincing
but I have a few reservations, as elaborated on below in my full review,
and would like to see a couple more experiments.
I think that it
is worthwhile to look at these kinds of deep temporal networks and to gain
insight into how they function after training. The paper is also easy to
read, and seems quite intellectually honest and thorough about its own
potential problems and shortcomings, which is something I especially
appreciate.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
GENERAL REMARKS:
ONE TASK ONLY: Concerning the
remarks on using only one dataset: the main reason for this is the page
limit, and the fact that this is a difficult realworld largescale
dataset in which we know a temporal hierarchy definitely exists. We chose
to verify our claims on this task properly, rather than spending the
available space on presenting more superficial findings on several tasks.
In the past we have used (relatively small) DRNNs on the TIMIT speech
corpus, which confirm the advantage of multiple layers (though we didn't
reach stateoftheart). In fact, the paper by Graves mentioned by
reviewer 6, is on TIMIT, and they reach extremely good performance by
stacking LSTMs, in virtually the same way as we do with RNNs. Even though
this result is obfuscated by the fact that LSTMs are somewhat different
from RNNs, deep architectures greatly improve performance here too, which
strengthens our claim. I suggest that we mention this point in the
discussion.
MODEL SELECTION: There seems to be some confusion
on this issue: the number of nodes per layer for each model was carefully
selected such that all models had a total of (as close as possible to) 4.9
million trainable parameters, this in order to make a comparison to other
work in literature. The DRNNs are defined by 5 recurrent weight
matrices, 4 interlayer weight matrices, a variable set of output weights
(1 or 5 for the DRNN1O and AO respectively) and a set of input weights
(this is Z_1). Including bias terms we obtain: DRNN1O: input weights:
727*97; interlayer: 4*727*728; recurrent: 5*727^2; output: 728*96; total:
4900072 DRNNAO: input weights: 706*97; interlayer: 4*706*707;
recurrent: 5*706^2; output: 5*707*96; total: 4896590 RNN: input
weights: 2119*97; recurrent: 2119^2; output: 2120*97; total: 4899224
Concerning model selection; the choice of 5 layers was based on
earlier experiments (performed in a somewhat different way, such that they
don’t fit in the paper as it is now), but 4 or 6 layers would probably
give similar results. Both these points can be made clearer by
changing the text in minor ways.
REVIEWER 4
The
metaparameters we optimised (crudely, given the length of a single
experiment) were those defining the training algorithm (initial learning
rate vs. number of training examples) and initial scaling for weight
matrices. Luckily, due to the fact that this is such a largescale
problem, there is little difference between test and train error, and we
can safely rely on training errors for metaparameter optimization.
REVIEWER 6
 Indeed, the concept of DRNNs is not
entirely novel (though we do in fact refer to the paper suggested by the
reviewer). We will change the text in order to make this clearer.
 Note that we essentially have used truncated BPTT (TBPTT), (by
running many "short" (250 character) sequences in parallel). Commonly,
TBPTT is concerned with even shorter sequences (order of 10), but for
characterlevel modeling this would mean the training can only take into
account wordlength sequences, and we would not be able to obtain the
longterm dependencies as shown in the paper. The reference given by the
reviewer is on wordbased language modeling, where short sequences contain
far more meaning.
REVIEWER 7
 Adding extra nonlinear
layers between time steps is indeed an idea we considered, but in the end
didn’t implement. For one it would be slow in execution (as there is less
lowlevel parallelism to take advantage of), and such an architecture
would likely suffer much more from fading gradient problems, as it would
entail a network which is N times as deep as a common RNN folded out in
time, N being the number of layers between each time steps.
 RNNs
can exhibit extremely large gradients very suddenly (associated with being
at or near a bifurcation). Simply using this gradient would lead to a very
large jump in parameter space, leading to unpredictable and usually
catastrophic results. We have also tried truncation, but this seemed to
impede performance in the end. The bifurcations themselves are not the
main problem, but the size of the gradient is. I suggest rewriting the
sentence in question to make this clearer. Indeed, the average of the
normalized gradient is not the same as the normalized average gradient,
such that there will be some difference between the two, but I suppose the
same point can be made about truncated gradients.
 Indeed, the
experiments are so far only suggestive, and not full affirmation of
increasing abstraction. We are currently contemplating better experiments
to gauge the level of abstraction of each layer. One potential way is to
sample optimal text sequences for individual node activations at different
layers, and see whether these show higher levels of abstraction higher in
the DRNN, but this is nontrivial due to the discrete nature of the input.
Due to page restrictions (and time limitations) we will not be able to
include this in the paper.
 The DRNN1O / AO is defined in the
text at the start of section 2.2.
 