Paper ID: | 7549 |
---|---|

Title: | Non-normal Recurrent Neural Network (nnRNN): learning long time dependencies while improving expressivity with transient dynamics |

The authors introduce a new method to enhance the expressiveness of orthogonal RNN via the Schur decomposition, based on previous work by Ganguli et al. Orthogonal/ unitary RNN surpass gated RNN (LSTM, GRU) in tasks that involve long-term dependencies, but are limited in computational expressiveness due to the unitary constraint. Using the Schur decomposition for non-normal matrices, the authors partly overcome this limitation, with the off-diagonal elements of the Schur matrix enabling information propagation and interactions between the different modes. Strengths: In my mind this is a strong paper that makes important contributions to the field, further advancing a promising concept for solving the vanishing/exploding grad problem in RNN. As the authors stated, their work is strongly rooted in previous work by Ganguli et al., but through further simulations and analyses they demonstrate how these ideas could boost unitary RNN. Weaknesses: Given that the authors argue convincingly that the Schur decomposition should give much larger expressiveness, it is somewhat surprising that the performance gap between nnRNN and LSTM (0.12) is still twice as large as the one between nnRNN and expRNN (0.06). This seems to indicate that nnRNN are not really that much further ahead than the best-performing orthogonal RNN when compared to LSTM. How do the authors explain this, is this just due to the gating mechanism that nnRNN lack? Minor: - Isnâ€™t the result in Fig. 1b completely expected, given that an orthogonal RNN lacks the off-diagonal elements (which could boost hidden states) in the Schur form?

Originally: The ideas presented in the paper are original. Quality: The theoretical and experimental analysis seems to be properly conducted. However, as the non-normal approach is compared to the performance of the expRNN and LSTM, comments should be added, how the asymptotic (in number of units or parameters) runtimes of the different approaches compare. Especially the runtime of optimizing P (line 187-189) should be clarified. Clarity: The paper is clearly written. Significance: Solving the proplem of vanishing and exploiding gradients is of large importance and at the same time a very difficult endeavor. Extending the orthogonal RNN approach and fixing its problems is one way to tackle this problem. The paper contributes novel theoretical ideas towards fixing this problems. The experimental improvement seems rather small and thus the immediate significance is probably limited. However, this work might be a step towards solving the vanishing and exploiding gradient problem. _____________________ After reading the author feedback and reviewer discussion I increase my score to 6. I think overall the paper has strong contributions and valuable insights, though tackling a hard problem. However, I ask the authors to state the asymptotic runtime of the proposed algorithms in the final version of the manuscript. I think it would be very bad to hide the O(n^3) runtime from the reader.

This paper proposes nnRNN, which is the non-normal matrix relaxation to improve the expressivity of RNNs with unit eigenvalue constraints. Previous studies use the unitary or orthogonal matrix constraints to prevent the exploding and vanishing gradient problems in RNNs, but these constraints deteriorate the performance in empirical tasks due to the strong constraints. This paper relaxes the unitary matrix (which is a normal matrix that has eigenvalues on the unit circle) constraints as allowing the non-normal matrices that have the eigenvalues on a unit circle. In addition, this paper theoretically and empirically shows how the non-normal relaxation affects the dynamics of RNNs. This paper is well written, and the motivation of non-normality is explained thoroughly. The main results are original and interesting since there are few studies that focus on whether the recurrent matrix is a normal matrix or non-normal matrix. However, I tend to vote to reject due to the following issues. 1. There is no guarantee that nnRNN solves the exploding gradient problem. The URNN paper [ASB16] shows that URNN can solve the exploding gradient problem due to the spectral norm (the largest singular value) of one rather than due to eigenvalues on a unit circle. The singular values of non-normal matrices can be larger than one even when they have eigenvalues of one. Since the spectral norm is larger than or equal to spectral radius (the largest absolute value of eigenvalues), the gradient vanishing might be prevented by nnRNN. However, nnRNN might not prevent gradient from exploding. As a simple example, we consider the case of a linear RNN h_{t+1}=Vh_{t} where V=[1,2; 0,1]. In this case, V is a non-normal matrix and its eigenvalues are 1 while its spectral norm is larger than 1. The Jacobian of dh_{T}/dh_{0} can be \prod_{i=0}^{T-1} dh_{i+1}/dh_{i}=V^{T}= [1,2T;0,1], and the norm of Jacobian increases according to T. Therefore, there is no guarantee that non-normal matrices can solve the exploding gradient problem even if they can alleviate exploding gradient. In fact, Fig. 1 shows that multiplying the non-normal matrix increases the norm of the hidden vector. This implies that the chain rule can make the gradient explode by using non-normal matrices. 2. The experimental condition might be unfair. In experiments, the eigenvalues of nnRNN are optimized as gamma while the eigenvalues of other methods are constrained to be one. Due to this difference, the contribution of non-normal matrix relaxation is not clear; the improvements in nnRNN might be caused by the relaxation of eigenvalues. -------------- I have read the author feedback. My questions are mostly solved, and interesting and important discussions will be provided. Therefore, I raise the overall score.