Paper ID: 336
Title: Tensorizing Neural Networks
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
My main issue is with the presentation of the crux of the algorithm in Sections 4 and 5. It has a messy presentation and needs to much more coherent if this paper is to be accepted.

Eq. 5 has no place in Section 4, it really should be in Section 5.

The sentence on line 214 beginning with "Another concern..." is meaningless.

In section 5, it would be much clearer to take a top-down approach where we start with dL/dW and show how each of the derivatives computed in Eq. 7-9 are relevant in computing this.

Lines 286-293 are a total repeat of previous statements and should be deleted.

The idea of using tensor tricks for speeding up neural networks is not new : papers in http://arxiv.org/abs/1209.4120 should be included in citations.
Q2: Please summarize your review in 1-2 sentences
This paper addresses the computational bottleneck present in CNNs, especially with regards to the fully-connected layers which add a large number of weights, slowing down both the forward-pass and the learning process. The TensorTrain formalism is introduced and is shown to significantly reduce forward-pass and learning runtimes along with forming an compact representation for weights in fully-connected layers.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Fully connected dense layers (W x + b) can be a nasty issue for neural networks since they require storing and working with MxN parameters, and this is often a bottleneck for producing larger models. However past work has shown that there is redundancy in this representation. This motivates decomposing the matrix W into smaller components, but there are several difficulties with doing this in practice.

In this paper, the authors utilize a general tensor decomposition form known as "Tensor Train" (Oseledets, 2011). They then describe and implement a standard dense layer utilizing this decomposition. The decomposition has several nice properties: 1) there are fast algorithms for running standard linear algebraic ops, and these can be used to implement f/b-prop, 2) the method is numerically stable even decomposed into many parts. They then use this method to decompose a dense layer into a much smaller number of parameters.

The main result of the work is shows that a dense layer in vgg can be replaced with a decomposed layer with a very large reduction in parameters and only a small reduction in accuracy, with no reduction in speed.

This is a very large improvement of a baseline method that just learns a low-rank representation. This seems like a impressive and useful result.

The paper itself could be much improved in terms of clarity. Some issues:

- Generally the figure and table captions are very hard to read and leave several important

- Figure 1 in particular is very hard to read and decipher.

- Having everything in the paper start with TT- becomes

confusing. Naming could be simplified

- There are several typos and grammatical issues, be sure to read over closely.

Questions:

- How does Tensor train compare here to other tensor decomposition formats?
Q2: Please summarize your review in 1-2 sentences
The presentation and clarity of this paper could be improved, but the results seem impressive and the method brings in an interesting ideas from tensor decomposition to deep learning.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a compact parametrization of the parameters in the fully connected layer of neural network. Explosion in the number of parameters in a NN poses significant challenges in terms of computational and storage needs. Recent research has tried to address this challenge by considering low-rank representation of the parameters (among other techniques), and this paper extends that line of research. However, instead of a pipelined approach as in previous work where the low-rank approximation is done on the learned parameters, this paper considers a tensor-based representation (tensor train) of the parameters and estimates this tensor along with the rest of the model. This is nice. Experimental results on multiple real world datasets demonstrate that significant parameter compression is possible while suffering minimal loss in accuracy. I think the paper has some interesting ideas which should be helpful to the community.

While the comparisons with hashing approach [2] is nice, I expected the authors to also perform comparisons with previous (sequential) low-rank approximation techniques. I hope the authors will consider such comparisons in the final version of the paper.

There are a few typos: Sec 6.1: "the both" citation [16]: 1968 => 1986 etc.
Q2: Please summarize your review in 1-2 sentences
This paper presents a method for compact parametrization of the fully connected layer of a neural network. Experimental results demonstrate that significant compression is possible without much loss in accuracy.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Various ways to descrease the model size without sacrificing accuracy has been popular in acoustic modelling field, and you should certainly include those positions in background review, also pointing out (perhaps comparing) relations between:

- Yongqiang Wang, Jinyu Li, and Yifan Gong, small-footprint high-performance deep neural network-based speech recognition using split-VQ, in ICASSP, April 2015 (something similar to your Hashnet position, but published earlier - Jian Xue, Jinyu Li, and Yifan Gong, Restructuring of Deep Neural Network Acoustic Models with Singular Value Decomposition, in Interspeech, 2013 - position 8 incorrectly assigns credit, people at google published a paper on quantising depnets back in 2012 or something like this, and in general, in 199X quantisiaion in NN training (Nelson Morgan work) was so obvious no one ever cared to publish a paper about such basic ideas

The first sentence of Related work does not make much sense to me - what do you mean 99% memory restriction?

Looks like there is a mistake in the 2nd column of Table 3 in timings?.
Q2: Please summarize your review in 1-2 sentences
(This is the light review).

The paper addresses an interesting problem of efficient factorisation of weight tensors using non trivial (as in case of rank-based approaches) Tensor Train framework. I think this might be an important contribution as training large models without requiring extra computational resources in big data era is certainly a challenge worth pursuing.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
First of all, we would like to thank the reviewers for their time and feedback.
We appreciate the reviewers' evaluation of our ideas as interesting and promising.
We also thank the reviewers for their comments on our naming, typos and clarity of presentation. We will improve the paper based on the feedback and have the draft proofread by a native speaker.

R2 "While the comparisons with hashing approach [2] is nice, I expected the authors to also perform comparisons with previous (sequential) low­rank approximation techniques. I hope the authors will consider such comparisons in the final version of the paper."

We show that our approach outperforms the traditional matrix low-rank compression scheme on the MNIST dataset (Fig. 1, "matrix rank" plot) and on the ImageNet dataset (Tbl. 2, the last three rows). We will try to make this comparison more clear as the matrix low-rank method is indeed an important baseline.

We thank R2 and R3 for very relevant citation suggestions, we will add the three papers to the related works section. We've already compared our approach with the low-rank scheme (SVD) on computer vision datasets. We see the experiments on speech datasets as an important future step since it's much more efficient to compress fully-connected layers when there are no other types of layers (as is often the case in speech recognition).
The Gaussian Processes paper (Gilboa, 2012) exploits the properties of a Kronecker product of small matrices (which is essentially a TT-matrix with the TT-ranks equal 1) to perform an efficient matrix-by-vector product. While their approach is closely related to our work, increasing the TT-ranks of a matrix to some r > 1 greatly increase the flexibility of the format.


R3 "Looks like there is a mistake in the 2nd column of Table 3 in timings?."
We reproduced the experiments in Table 3, there were no mistakes in the numbers. The surprisingly high runtime for the TT-layer 100 images example is caused by somewhat inefficient CPU-based implementation used in the TT-toolbox.


R6 "How does Tensor train compare here to other tensor decomposition formats?"
Compared to the Tucker format (Tucker, 1966), the TT-format does not suffer from the curse of dimensionality. There are no stable algorithms for the canonical format (Caroll & Chang, 1970) and the set of all canonical rank r tensors is not closed, which can lead to a poor optimization convergence (Lebedev, 2015). The Hierarchical Tucker format (Hackbusch & Kühn, 2009; Grasedyck, 2010) is an alternative stable tensor factorization which is similar to the TT-format, but the later has simpler algorithms for basic operations. We will add an overview to the related works.
Despite the mentioned pros of the TT-format, exploring other tensor formats is an interesting future research direction.