
Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper provides a theoretically grounded adaptive
step size for policy gradient methods in which the lower bound of the
expected policy improvement is maximized. Approximated version of step
size can be computed only by estimated gradient and thus the computational
complexity would be same as ordinary policy gradient methods. There is a
mild assumption that the standard deviation of Gaussian policy is fixed
for deriving the lower bound. Experimental results show that the proposed
method works well when the fixed SD is large enough.
Quality,
clarity and originality: The motivation of the paper is clearly
described and its organization is reasonable. The paper provides
potentially useful theoretical analyses such as Theorem 4.3 and Corollary
4.4 although I cannot check the derivation of lemmas and theorems in
detail. The originality is definitely high.
Significance:
Newton method would be a standard approach for adaptive stepsize but
the secondorder derivative of expected return with Gaussian policy is not
guaranteed to be semipositive definite. Thus, adaptive stepsize based on
the maximization of lowerbound would be a good alternative of ordinary
policy gradient methods. However, as the author(s) review in the
introduction, there are several extensions of policy gradients for stable
policy update, such as EM policy search. Thus, it would be much more
convincing if there are more experimental comparisons with these methods.
Small comments and questions: * The section of conclusion
should be added. * Is it really possible to converge to the optimal
policy with only one update for \sigma=5 (in Table
1)? Q2: Please summarize your review in 12 sentences
The paper provides potentially useful theoretical
analyses and good alternative of policy gradient methods. It would be much
more convincing if there are more experimental comparisons with
stateofart methods such as EM policy search. Submitted by
Assigned_Reviewer_7
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper investigates how to automatically set the
step size of policy gradient methods, in particular focusing on the
Gaussian policy case. The paper provides a lower bound on the difference
in performance when taking a gradient step, and optimizes this lower bound
with respect to the step size.
Overall, I like the approach of
thoroughly studying the interplay between step size and policy
performance. While the paper could be improved with clearer, more direct
results, I think this is a fine, albeit incremental result for NIPS.
Quality. The paper contains a hefty amount of theoretical work,
but sometimes lacks a certain focus, e.g. Theorem 3.3 seems extraneous to
the main result. I think the paper would be improved with a clear
directing line throughout and less "beating around the bush".
The
experimental results are somewhat disappointing. While a meaningful
comparison, they don't strongly confirm the superiority of the approach.
Clarity. The paper is sometimes hard to follow because some of the
steps are relegated to the supplemental material or ignored altogether.
For example, the authors argue repeatedly using stationary points without
providing formal details.
Originality. This is original work.
Significance. I appreciate the idea of improving policy gradient
methods by refining the step size, and doubly so by obtaining a
theoretical grasp of the problem. But I think the results presented here
are marginally significant.
Suggestion. The algorithmic aspect of
this work (Section 5) comes in late, and seems mostly an afterthought. I
think the paper could be much improved by emphasizing the algorithm, and
subsequently deriving the quantities needed in the
bound. Q2: Please summarize your review in 12
sentences
Overall, I like the approach of thoroughly studying
the interplay between step size and policy performance. While the paper
could be improved with clearer, more direct results, I think this is a
fine, albeit incremental result for NIPS.
Submitted by
Assigned_Reviewer_8
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Overview ======== The paper proposes a lower
bound on the expected performance gain for policy gradient methods. This
lower bound can then used to optimize the stepsize. The bound is given in
general, specialized for Gaussian policies, and finally a version for
REINFORCE and G(PO)MDP/PGT is given, which is also evaluated empirically
on a simple LQG problem.
Quality ======= The theoretical
part is very nice with a lot of material. The experiment nicely
illustrates the approach. It would be highly interesting to evaluate how
well the adaptive stepsize performs in motor control and robotic
applications (but that is clearly out of the scope of this paper).
Clarity ======= The paper reads well. The paper itself
nicely manages to convey a notion of where the theorems and lemmas come
from/how they are proven. I like the structure of the paper that goes from
more theoretical to more and more applied. However, towards the end (Sect.
6) I got the feeling that it gets a bit crammed...
Originality
=========== To my knowledge nobody has looked into adaptive
stepsizes for policy gradient methods from a theoretical point of view
yet.
Significance ============ Even though tighter bounds
would be nice (as mentioned in Sect. 6), the results are interesting from
a more theoretical point of view and even show promise in experiments.
Minor Comments: =============== Table 1: maybe you could
add another line with the optimal stepsize (determined by line search as
discussed in Sect. 1.)
l 75: "do not need to tune a free
parameter" is a bit misleading. It is true that this type of algorithms
does not have a stepsize but still e.g. the exploration magnitude needs
tuning...
Table 1: maybe you could mark the "winners"
Table 1: 1 step for alpha* with sigma=5. is hard to believe/a very
lucky coincidence
Table 2: How do you count the iterations here?
The number of update steps or (number of update steps)*(number of
trajectories)?
Table 2: maybe use e notation instead of
>100,000
Table 2: especially here a comparison to the
heuristics would have been interesting
Supplementary Material Fig.
1&2: maybe you could use a logarithmic iteration axis (such that we
can see better what happens at the beginning with alpha=1e5/t). Axis
labels would also be nice.
Public Review: ==============
The author response addresses most of the above points. The notes on
spelling and grammar mistakes as well as minor formatting/LaTeX
suggestions were removed. Q2: Please summarize your
review in 12 sentences
The paper tackles a very interesting unsolved problem
from a theoretical point of view and manges to get results that have great
potential for practical applications. Submitted by
Assigned_Reviewer_9
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a new policygradient
reinforcement learning method in which the step size is set adaptively by
maximizing a lower bound on the expected performance gain. The authors
first consider the case where the policy gradient is known and establish a
lower bound on the difference in performance between the original policy
and the modified policy resulting from an update that follows the policy
gradient. Since the above bound is a fourthorder polynomial, the authors
then consider the special case in which the stochastic policy is
represented as a Gaussian distribution with a fixed standard deviation and
a mean that is a linear combination of the state features. In this case,
they derive a lower bound that is quadratic in the step size and has a
single maximum for positive step sizes. Next, they remove the assumption
that the policy gradient is known and establish a highprobability lower
bound given an epsilonoptimal estimate of the policy gradient. Finally,
they show how two existing techniques, REINFORCE and G{PO)MDP/PGT, can
provide such policy gradient estimates, leading to an adaptive step size
policy gradient method that does not require a model of the environment
and can learn only from sample trajectories. The proposed method is
evaluated in a simple simulation experiment using a 1dimensional toy
problem.
Overall, this is a highquality, well written paper. The
authors make a good argument that most research on policygradient methods
has focused on finding better gradient estimators and thus there remains
room for improvement in automating the step size in such algorithms. The
theoretical results in the paper make a substantial contribution by
providing a principled framework for the automatic selection of such step
sizes by maximizing lower bounds on expected performance gain. The paper
contributes not only a specific algorithm but more general results that
could be used to derive algorithms with different policy representations
and gradient estimators.
The main weakness in the paper is the
empirical evaluation, which considers only a toy problem involving the
optimization of one policy parameter. Results obtained using the true
gradient show that manual choices of the step size can lead to slow
learning (too small) or divergence (too big) whereas the adaptive method
always converges within the threshold. For large values of the standard
deviation of the Gaussian policy, the adaptive method performs better than
any of the fixed stepsizes tested. However, for small standard deviations
it performs worse, which the authors suggest reflects an inherent
tradeoff between the determinism of the policy and tightness of the lower
bound.
The authors also present results for the case where the
true gradient is not known and the REINFORCE and PGT estimators are used
instead. However, it is not clear what these results demonstrate, other
than that, unsurprisingly, increasing the number of trajectories used to
estimate the gradient improves the gradient estimate and thus learning
performance. More worryingly, however, the results also show that even
with many trajectories, the errors are quite large, leading to loose
bounds that prevent strong performance. This casts doubt on whether the
results have practical implications or are only of theoretical interest.
Furthermore, the empirical results compare only to fixedstep size
variants and not to other baseline methods. In particular, the EMbased
policygradientlike methods of Kobers & Peters and Vlassis et al.
mentioned in the introduction seem like particularly relevant baselines
since they do not need to tune a free parameter.
Q2: Please summarize your review
in 12 sentences
Well written paper with a substantial theoretical
contribution. The empirical evaluation is preliminary and yields mixed
results for the proposed method.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their insightful comments
and valuable suggestions to improve our paper. Below we first provide
clarifications to general doubts that are common to some reviews and then
we answer to each reviewer's specific questions.
We are aware that
our paper provides many contributions (theoretical, algorithmic, and
empirical) that needed to be condensed to meet the constraints of the
conference format. Actually, we tried to focus the paper as much as
possible by cutting other contributions that we have included in the first
drafts of the paper (e.g., the case where the Gaussian standard deviation
is not fixed and the derivation for the Gibbs policy model).
Furthermore, to save some space, we decided to merge the empirical
results and conclusions sections. We agree that this final section is
quite compressed, but, in our opinion, the alternative would have been to
drop the experimental section. We thought that showing some empirical
evaluation of the proposed approach is a valuable contribution even in a
mainly theoretical paper.
We agree with the reviewers that adding
some comparisons with other related approaches would improve the
significance of the empirical evaluation. In particular, we will add
in the final version of the paper a comparison with EM policy search.
For what concerns a comparison with natural policy gradient (NPG), the
problem is that NPG itself relies on the choice of the step size. What
we could do is to extend our approach to NPG, but this would be another
contribution that cannot fit in this eightpage paper, but could be
considered for future works.
The surprising onestep convergence
happening with sigma equal to 5 is not a mistake. It is worth to
notice that we stop the algorithm when the accuracy is below the 0.01
threshold w.r.t. the optimal solution. Clearly, even with suitable
fixed step sizes it is possible to achieve such a behavior (e.g., when
sigma=5 by taking a fixed step size in the interval [3.28e6,3.39e6]
attains a 1step convergence), but small perturbations may lead to
divergence. Our adaptive approach does not require to handtune the step
size thus leading to much more robust policy gradient methods.
Assigned_Reviewer_5
We think that the main issues raised
by the reviewer have been clarified by the arguments reported above.
Assigned_Reviewer_7
The motivation for the theoretical
results introduced in Section 3 is twofold: from one side we introduce
general results that can provide the basis for future works (e.g., using
the theorem 3.3 and specialize it for other policy models), while from the
other side it introduces concepts that simplify the presentation of the
following sections.
The goal of the empirical evaluation was not
to prove the superiority of our proposal, but to verify that an automatic
stepsize selection can add robustness to policy gradient methods. In
fact, if the problem is wellknown or many simulations can be performed in
a short time, the stepsize parameter can finely tuned by hand. Our
proposal provides an automatic way to set the step size, obtaining fair
results under different conditions.
The organization of the paper
reflects our point of view about which are the main contributions of the
paper. The paper starts with a general theoretical contribution, that
is later refined for a specified policy family and then we propose two
algorithms for determining the step size when the policy gradient is
estimated through samples. Finally, we thought that adding some empirical
results, even on a toy example, is actually relevant. Since the algorithms
are straightforward derived by the theoretical results, we thought that
the current paper organization was quite smooth. Nevertheless, we will
consider the reviewer's suggestion for the final version.
Assigned_Reviewer_8
We would like to thank the reviewer
for providing several detailed comments to improve our paper.
The
numbers in Table 2 are the number of updates of policy parameters (like in
Table 1).
We think that the other issues raised by the reviewer
have been clarified by the arguments reported at the beginning.
Assigned_Reviewer_9
The contributions of this paper
are mainly theoretical, but we think that they can be a strong basis to
produce practical algorithms. Results in Table 2 have the goal to
empirically evaluate the bounds derived in Section 5. As noted by the
reviewer, results show that even with a quite large number of trajectories
the bounds on the gradient are still not small enough to get really close
to the optimal policy while producing a sequence of monotonically
improving policies. Empirical results are meant to evaluate both
strengths and weaknesses of the proposed approaches, and the latter ones
are helpful to suggest future research directions (in this case
identification of tighter bounds will clearly improve the practical
implications of the proposed approach).
We think that the other
issues raised by the reviewer have been clarified by the arguments
reported at the beginning.
 