NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 5078 Using Statistics to Automate Stochastic Optimization

Reviewer 1

This paper studies how to test the stationarity of stochastic gradient with momentum using some advanced testing statistics that take the time correlations into accounts. Extensive experiments are run to demonstrate the advantage of the proposed method over existing approaches. Originality: The paper is based on extending a recent paper by Yaida. It does not seem that original to me but the authors do combine the condition by Yaida with some more advanced testing statistics in a new way. Overall I think the extension is quite natural, so the conceptual novelty is not that high. But to be fair, the paper still has presented some solid new results. Quality: I like the extensive experiments in this paper. It seems to me the quality of the experiments are reasonably good. On theoretical side, I like the intuitions in this paper: the time correlations have to be taken into accounts. However, it seems there is a gap between the assumptions in the theory and practice. The authors assume the gradient estimate is Markov in theory but in simulations they just run many different epochs. If I understand correctly, the data are reused when running different epochs. Then clearly there are time correlations between the reused data at different steps. It seems the Markov assumption is just too strong. If all the data are only used once and the gradient is sampled in an online manner, then the Markov assumption seems reasonable. But clearly the authors are not considering this case, since the online setting where the data is only used once just does not make sense for the overparameterized regions. Overall I don't think the proposed method in this paper has a solid theoretical foundation, although intuitively the Markov assumption is better than the IID assumption. The Markov assumption is not precise for the case where data is reused in multiple epochs. Another possibility is that fixing the data set and only considering the randomness in sampling. For this case, the Markov assumption seems fine but how to prove the time homogeneity here? Clarity: The paper is well written and quite clear. Significance: It is possible that the proposed testing statistics or some variants may be useful for practical problems although there are many other heuristics that can also be used to determine whether the SGD iterations have reached plateau or not. In addition, the authors' approach require setting another hyperparameter delta. This may introduce some extra difficulty in applying their approach for practical problems. Overall I think the overall significance is at the moderate level. =============================================== ============================================================ AFTER REBUTTAL: Thanks for the further explanations. The authors have addressed all concerns. I have increased my score.

Reviewer 2

The paper improves upon the main work from Yaida et. al by introducing a much more rigorous test for whether the chosen statistics indeed converge to zero. I think the authors introduction of the t-test and especially the improved estimators that take into account the temporal Markov Chain correlation is indeed great. The main text is very easy to follow and quite clear, with very clear idea and aim which the reader can follow. Several points of discussion: 1. I think a more important point though is that the authors should have plotted results with the learning rate adaptation of equation 9 for the comparison. From the variance tests and Fig.4 one can conclude that the newer test is better, but I think it is also important to measure to how much improvement in actual optimization that translates to? 2. There is an empirical analysis of the choice of the decaying constant (Fig. 3), however would be interesting to see similar plots for the confidence parameter (gamma in the paper) in the t-test interval construction. 3. On the comparison with ADAM it is claimed that it is "hand tuned". Given that SGM has decreasing learning rate schedule, there is no reason not to have run ADAM also with such cut-and-decay schedule and optimize it as well. 4. Any discussion on the higher order comparison rules from Yaida et. al would be beneficial. 5. It would be quite interesting to see if ADAM itself can benefit from SASA as well as how does the statistics of interest in the paper behave under ADAM in the first place. Figure 4 is really confusing as there is no legend or explanations in the figure title on what the different curves actually represent - I was expecting two lines for the two tests in eq.9 and 10 but there are clearly 5 (one figure has a single red curve)?