Reviews: Neural Temporal-Difference Learning Converges to Global Optima

This paper studied the non-asymptotic convergence rate of the TD algorithm under overparameterized neural network approximation. For such nonlinear function approximation, TD algorithm may not even converge. The paper exploits the fact that the overparameterized neural network has implicit local linearization, which guarantees the convergence of the TD algorithm to global convergence. The paper also generalizes the analysis to Q-learning and policy gradient (in appendix). This is the first study that made connection to recent breakthrough analysis of overparameterized neural networks, and leveraged the property for establishing the convergence rate for reinforcement learning algorithms. The contribution is solid. The presentation of the paper is clear and easy to follow. I have the following question for the authors to clarify. The neural TD algorithm adopts the average of the parameter matrix W as the output. It allows the constant stepsize, but requires preset the number of iteration T. Is there any specific reason to adopt such an algorithm? What happens for the simple semi-SGD output without taking the average? This may require diminishing stepsize to converge or constant stepsize to converge to a neighborhood of the global minimum. But in terms of the analysis, will the current approach be applicable?

Reviewer 2

Originality: ========= The paper relies on recent results on the implicit local linearization effect of overparametrized neural networks in the context of supervised learning, and on recent nonasymptotic analysis of Linear TD and Linear Q-learning. Perhaps the main insight is the relationship between the explicit linearization of Linear TD and the implicit linearization of overparametrized neural TD. Related work is properly referenced. Quality: ========= The paper seems to be technically sound (although I have just skimmed over the proofs). The convergence of the three algorithms, namely Neural TD, Neural Q-learning, and Neural Soft Q-learning constitute a complete piece of work. The authors explain the assumptions for the analysis, and discuss their generality and how they relate to previous literature. The only confusion on my side is that the authors talk about 2 layer neural networks, which made me think about a deep architecture with 2 hidden layers. However, Eq. (3.2) seems to define a single hidden layer with linear output. Clarity ========= The paper is clearly well written and well organised. Significance ========= The fact that they have been able to use the same approach to study nonasymptotic convergence of three different Bellman operators is promising. In addition, it is worth to remark that similar ideas to those presented here have been used to study a control algorithm in another submission titled "Neural Proximal Policy Optimization Attains Optimal Policy (5585)," potentially with overlapping authors. However, such submission is different enough from this one, which highlights the usefulness of the ideas presented here. Although the authors assume a simple neural network architecture, I imagine that similar linearization effects can be explored in more complex architectures.

Reviewer 3

The paper consists of solid enough efforts to analyze the convergence rate. While I am convinced by the results, I cannot verify the details of the proof. I have some questions regarding the theory, e.g. I had a difficulty of finding the notion of "one-point monotonicity" in the paper, which probably should be included for self-containing. The assumption 4.3, 5.2, incorporate some constants c1, c3, \mu, however, at least in the main paper, there is no clear explanation of how these constants relate to the convergence rate. The paper involves a two-layer neural network, but still, it seems to be like a linearly parameterized model, as well as the rate, if all these assumptions are to apply. The implication of theorem 4.4, 4.6 has not made clear how the width of neural network effects the rate, and it seems to be hidden in the big-O notation, so it is hard for me to value the importance of overparameterization, or explicitly, the number of nodes in this two-layer neural network, comparing to number of iterations.

Paper ID:	6047
Title:	Neural Temporal-Difference Learning Converges to Global Optima

Reviewer 1

Reviewer 2

Reviewer 3