Paper ID: | 3001 |
---|---|

Title: | Limiting Extrapolation in Linear Approximate Value Iteration |

Originality The work is in some ways similar to Yang and Wang but takes different approaches and derives different insights. Originality is good. Quality I find the quality very good; the work is thorough and the experimental results are appropriate to their purpose, which is to illustrate the theory rather than to provide state-of-the-art applied results. l.119 "The best approximator..." - There are many definitions of a "best approximator." Please give some insight to the reader on why this particular definition of "the best approximator" was chosen. Throughout: "Lem." and "Thm." are used inconsistently with inconsistent spacing. Just write them out. l.150 - "minimizing ... and lower" - "minimizing ... and lowering" l.298 "represent well action-value functions" - "represent action-value functions well" The paper is very well written. I have a few specific comments. Significance As indicated above, I think the ideas will be influential in future work on limited-data RL problems. *** I have read the author response and I appreciate their clarifications.

After rebuttal: Thanks the authors for addressing my concerns. I read the authors' feedback and other reviews. This work has some contribution on the theoretical side but I believe its empirical contribution is limited. The experiment is simple and the algorithm might not work very well in general. I'll keep my original score. ----------------------------------- This paper studies the finite horizon value iteration problem with linear approximation given a generative model. This paper is clearly written and easy to follow. The author proposes an algorithm and then prove a sample complexity bound for the algorithm. Finally, some experiments on the toy models are shown to support the theoretical result. The AVI problem has been studied for a long time. There are mainly two types of guarantees needed for the convergence. One is the bounded inherent Bellman error and the other is the contraction of Bellman update. The main idea of this paper is to balance the generalization and compactness in LAVI. The algorithm will run backward. At timestep t, the agent uses the generative model to draw the backup values and timestep t+1. Then it solves an optimization problem to obtain the interpolation coefficient \theta of each anchor point in timestep t. The sample complexity will be low if the amplification factor and the number of anchor points can be controlled. The result will hold even under infinite inherent Bellman error. Major: 1. The idea of trying to balance the compactness and generalization is new to me. 2. The sample complexity is polynomial on the number of anchor points K, the length of horizon H, and the amplification factor \bar{C}. However, the limitation is that \bar{C} seems usually to be exponential. Although Prop 1 shows a specific case that \bar{C} is small, the number of anchor point seems not controllable. Usually, the smaller the amplification factor, the larger the number of anchor points. My major concern is that the final sample complexity might still be exponential. 3. This paper does not consider the selection of anchor points. I agree that finding the anchor points are generally hard. Suppose a set of (probably good) anchor points are given in the last timestep H, is it possible to design an algorithm to automatically find the anchor points from timestep from H-1 downto 1? I’m also interested that whether the result of this work can be extended to the infinite horizon case since a closely related work Yang and Wang [15] considers discounted infinite horizon MDP. 4. In the theoretical part, the result holds for the continuous state space. However, the authors only investigate the discrete state space in experiment. Does the algorithm also work empirically in the continuous state space? Minor: 1. Line 178: in not unexpected --> is not unexpected 2. Line 252: Fig. 5 --> Fig. 2

## Summary This paper consider linear-combination-of-features approximation to the Q-value function in a finite horizon reinforcement learning problem. While the linear approximation requires few parameters and scales efficiently with the state and action space dimension, when Q-valuate iteration is applied to it, the combination of least-squares projection and Bellman operator applications may be expansive leading to divergence. The paper proposes to use the linear approximation to the Q-function at a finite set of anchor states and linearly interpolate to other state-action pairs. This balances the compactness of a linear approximation with reduction in error amplification after Bellman updates due to the interpolation. The authors prove a bound on the error of the proposed Q-function interpolator with respect to the best linear linear-combination-of-features approximation to the Q-value function. ## Recommendation The paper is well written and easy to follow. The idea of trading off linear approximation and extrapolation in linear-combination-of-features function approximation is interesting. The theoretical intuition about error scaling and control of the error amplification due to Bellman iterates is a good and novel contribution. A weakness of the proposed method is that its effectiveness still depends critically on the availability of good features \phi. The paper could be strengthened if ideas of how interpolation can be applied to other value function approximators were provided. The experimental evaluation is insufficient as only simple problems are considered and not nearly enough experimentation is provided to show empirically the error scaling, the effect of the anchor point selection, the advantage over averagers, or the dependence on the choice of \phi. More results should have been provided on the dependence between the anchor number and location optimization and the choice of features \phi. The theoretical derivation of the error bounds is illuminating in terms of the structure of error scaling but the bounds themselves do not seem practically useful. ## Major Comments + The introduction and preliminaries sections are very well written, giving the right context and a formal problem definition. + The specification of the necessary number of samples to minimize the error between the interpolant and the best linear Q-function approximation in Lemma 2 is nice. The greedy approach to adding anchor points to meet a specified amplification target is a nice contribution. - The greedy heuristic to construct a good set of anchor points before the learning process is a good addition to the proposed algorithm but is not explored in sufficient detail. A discussion on how complex the optimization problem max_{s,a} ||\theta^{\phi(s,a)}||_1 is should be added. Examples of the numbers, positions of anchor points, and the resulting extrapolation coefficient for common choices of \phi would have been nice to see. - The experiments are similar and somewhat simple. It would have been interesting to evaluate LAVIER on a real scale problem in addition to the toy problems vs LS-AVI. For example, it would have been great to see how LAVIER performs on a reinforcement learning problem where a good choice of features is not known but polynomial, RBF, or kitchen sink features are used. A comparison versus averagers and/or OPPQ-Learning [15] would have been interesting as well. It would be have been good to provide more variations on the Chain MDP evaluation, in terms of different feature choices and different planning horizon choices. ## Minor Comments: - Typos: "Whenever the ineherent Bellman error is unbounded," - It would be cleaner to explicitly write the dependence on s_i and a_i in eq. (1) - The meaning of \delta should be introduced more clearly in Lemma 2 - In Fig. 3 middle, what value of C is achieved by the shown anchor points? - The notation is sloppy at places, switching from arguments in parentheses to subscripts for functions or from having two to one subscript (e.g., Proof of Proposition 3, around eq. (2))