NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:95
Title:You Only Propagate Once: Accelerating Adversarial Training via Maximal Principle

Reviewer 1

Summary: This paper proposes a method for speeding up adversarial training by reducing the number of full forward- and backward passes during the inner loop where the adversarial examples are computed. To achieve that, the inner loop that calculates the adversarial perturbation is split into an inner and outer loop. The gradients for all but the first layer are only infrequently calculated in the outer loop, while the gradients of the first layer w.r.t to the adversarial perturbation are iteratively calculated in the inner loop. This method is motivated by casting adversarial training as time-differential game and analyzing the game using Pontryagin's Maximum Principle (PMP). This analysis indicates that only the first layer is coupled with the adversary update. The authors relate their work to "Free-m" [1], another recent work that aims to speed up adversarial training and mention that their method is a generalization with a minor change of the work in [1]. Analysis of deep learning training as control problem was previously been done in "Maximum Principle Based Algorithms for Deep Learning" [2] Originality: - The main contribution of this paper is to speed up adversarial training where the method used is motivated by casting adversarial training in the light of control theory. There has been research on the control theory perspective on training neural networks, but not related to adversarial training. Quality: - In the Algorithm following Line 138 and following Line 143 the gradients are not normalized and also not projected back on the epsilon-ball as typically done in PGD [3]. Is there any reason for that? - A non-mathematical intuition about why the adversarial update is only coupled with the first layer is not given. This would be especially interesting, as this claim is counterintuitive compared to previous methods for generating adversarials that need to differentiate through the whole network. - The experimental section lacks evaluation of stronger versions of the PGD attack (restarts, higher number of iterations) and also evaluation with non-gradient based methods. - In Line 155-159 the authors write "we take full advantage of every forward and backward propagation, […], the intermediate perturbations are not wasted like PGD-r […] which potentially drives YOPO to converge faster in terms of the number of epochs". This claim has no support in the experimental section, e.g. a plot loss over epochs. Clarity: - Theorem 1 is very hard to follow and unclear to a reader not familiar with the Maximum Principle. ○ Why is the Hamiltonian defined the way it is? - Multiple variables are not described and unclear ○ Is p in the Hamiltonian the same as the slack variable p? ○ Notation of p_i_t (used in Theorem 1) is not introduced. - Algorithm listings after Line 138 and Line 143 are missing captions - Line 157: "allows us to perform multiple updates per iteration" What iteration is meant here? Formula 144 suggests that the weights are only updated once after the m-n loop has finished. - If the coupling is only between the first layer and adversarial perturbation, why does p needs to be updated in the outer loop? - What epsilon is used during training? Significance: - Given that adversarial robustness is only slowly moving towards bigger datasets like ImageNet, the significance of a method to speed up adversarial training is high, assuming that the robustness claims are true and hold in the setting of larger datasets. - Only one other method [1] has addressed the topic of speeding up adversarial training, thus this paper is a very welcome contribution to the field. According to Table 1, the method proposed in this work further improves the training time over [1]. [1] Shafahi et al. - Adversarial Training for Free!, [2] Li et al. - Maximum Principle Based Algorithms for Deep Learning, [3] Madry et al., Towards Deep Learning Models Resistant to Adversarial Attacks,

Reviewer 2

From what I can tell, the observation that the adversarial perturbation is coupled with only the first layer and the exploitation of this to create the YOLO method is novel and an interesting contribution to the literature that could potentially inspire a lot of follow-up work. Theorem 1 requires that f is twice continuously differentiable, but in the experiments the authors use ResNets, which have ReLU activation functions. How do the discontinuities in ReLU affect the theory? A more extensive and detailed experiments section in the main text would be good. Some additional comments/questions: -- It'd be grammatically more correct to call it "You Propagate Only Once" :) -- Shouldn't Theorem 2 be a Lemma? -- In the description of PGD, it should be eta^r, not eta^m (last line). -- Figures 3a and b are hard to see, they should be made bigger. -- Consider using a log scale for the y-axis in Figure 3a. -- It's generally better to indicate if the percentages given in tables are test accuracy or error, even if it appears obvious. -- There are a number of typos in the manuscript, for instance: "are the network parameters (line 28); "exploits"(caption of Figure 1); "min-max" (l. 66); "needs" (l. 67); "adversarial" (l. 87); "Contributions" (l. 88); rewrite the sentence starting on line 109, for instance "... is a game where one player tries to maximize, the other to minimize, a payoff functional"; "could significantly" (l. 159); "on the Hamiltonian" (l. 204); "Small CNN" (caption of Figure 3a). -- There's an unresolved reference on line 29 in the supplementary material.

Reviewer 3

This paper proposes a new algorithm for computing PGD attack, and applies the proposed algorithm to adversarial training. The experiments show that the computational cost is reduced by 2/3, and achieves the similar performance. The key insight of this paper comes from a dynamical system view of back propagation algorithm as a recurrent network. This insight, however, actually dates back to 1980's. For example, Le Cun 1988 has discussed this view in Section 3.1 Moreover, the proposed algorithm is not the only way to approximate the backpropogation algorithm. There have been some truncation heuristics in existing literature, e.g., Williams and Zipser (1992), Gradient-Based Learning Algorithms for Recurrent Networks and Their Computation Complexity. Williams and Peng (1990), An Efficien Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories, These truncation heuristics can reduce the computational cost of computing adversarial attacks. The authors did not provide any experiments to justify their method is better than these heuristics. The application to adversarial training is somehow a novel selling point, since there are not many existing works discussing the importance of truncation in adversarial training. The authors claim that the training time is reduced by 2/3~4/5. However, this highly depends on the implementation and hidden factors, e.g., library used. All experiments are based on one single realization. I did not see any details on multiple realizations and standard errors. Therefore, more experiments are needed.