NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 4657 A Little Is Enough: Circumventing Defenses For Distributed Learning

Reviewer 1

Originality: to play the devil's advocate, the key message of this paper is "outside their working hypothesis, mainstream defense mechanisms do not work", is not that somehow a tautology ? Clarity: the paper is fairly well written, my concern is more with what is missing, rather than the clarity of what is already there. Precisely, what is missing is a thorough analysis of what is really wrong in the aforementioned hypothesis. Significance: The main point of this paper is to show that the prerequisite hypothesis of previously published defenses might not always hold. In particular, the empirical variance of the correct workers might be too high to enable robust statistics methods to work. This paper would gain significance by elaborating more on that. I was surprised to see this question mentioned in the abstract but not that much elaborated in the main paper.

Reviewer 2

In general, I like the question this paper asked, i.e., whether or not it is necessary to impose a large deviation from the model parameters in order to attack distributed learning. Most of the research in Byzantine tolerant distributed learning, including Krum, Bulyan, and Trimmed Mean, uses some statistically "robust aggregation" instead of simple mean at the PS to mitigate the effects of adversaries. By the nature of robust statistics, all of those methods takes positive answer to the above question as granted, which serves as a cornerstone for their correctness. Thus, the fact that this paper gives a negative answer is inspiring and may force researchers to rethink about whether or not robust aggregation is enough for Byzantine tolerant machine learning. However, the author seems not aware of DRACO (listed below), which is very different from the baselines considered in this paper. L.Chen et al Draco: Byzantine-resilient distributed training via redundant gradients, ICML 2018. The key property of DRACO is that it ensures black-box convergence, i.e., it does not assume anything about the attack Byzantine workers use to achieve convergence. Thus, the "common assumption" is not made by DRACO, and it is NOT reasonable to claim " all existing defenses for distributed learning [ 5 , 10 , 27 , 28 ] work under the assumption". While this paper's idea is creative, it does not seem to be fully developed. The proposed attack is shown to break quite some existing defense methods empirically. Yet, does it break DRACO? If not, does it mean that DRACO is "the right solution" to the proposed attack? More discussion is in need. In addition, theoretically, when would the proposed attack break certain defense (depending on the parameters chosen, datasets, etc)? Is there a simple rule to decide how to break a certain defense? The experiments were also oversimplified. Only three models on small datasets with a fixed distributed system setup are far away from enough to validate an attack empirically. The main idea of this paper is clear, but the writing itself does not seem to be polished. A lot of typos exist, and some sentences are hard to understand. The use of math notations is messy, too. See below for a few examples. 1) Line 4, "are (a) omniscient (know the data of all other participants), and (b) introduce ...": "are (a)" should be "(a) are"; 2) Line 10, "high enough even for simple models such as MNIST": What model is MNIST? To the best of my knowledge, MNIST is a dataset, NOT a model (such issues exist in the experimental section as well); 3) Line 45, " Likewise, all existing defenses for distributed learning [ 5 , 10 , 27 , 28 ] work under the assumption that changes which are upper-bounded by an order of the variance of the correct workers cannot satisfy a malicious objective": This is hard to understand. Rephrasing is needed. 4) Algorithm 3, Line 2: It is not clear what objective the optimization problem maximizes. My guess is that the authors want to maximize z, which should be written as $\max z s.t. \phi(z) \leq \frac{n- m-s}{n-m}$; 5) Line 326, "state of the art" -> state-of-the-art. ================================================= Thank the authors for their explanation.

Reviewer 3

The paper presents a new attack on distributed learning systems. The threat model and targeted defenses are clearly explained, and the attack is explained nicely, with easy-to-follow intuition along with a more precise technical explanation. The properties of the proposed attack are strong: not only circumventing previously proposed defenses (and showing that some previous intuition is incorrect), but also being non-omniscient. The paper is a solid contribution. Questions and areas for improvement: The evaluation does not say whether the m=12 malicious workers are omniscient or not. If they are non-omniscient, it would be great to reiterate this. And if not, that seems like a serious issue: in that case, there would be no substantiation of the claim of a non-omniscient attack. Line 80 says "in this type of attack, the attacker does not gain any future benefit from the intervention." This may not be true without additional context. For example, consider a scenario where a company attacks the distributed training of a competitor: in this case, there is clear benefit to performing the attack. It would be interesting to see more discussion about where such attacks/defenses may be applicable in the real world, along with further elaboration of the threat model. In current practical uses of distributed training, are all nodes trusted? What future systems might be susceptible to such attacks? Lines 215--221 discuss the situation when the attacker is not omniscient. In particular, the paper says that when "the attacker controls a representative portion of the workers, it is sufficient to have only the workers' data in order to estimate the distribution's mean and standard deviation ..." What is a representative portion? It would be great to briefly summarize the threat model in the abstract (e.g. "percentage of workers are corrupted and collude to attack the system, but are non-omniscient and don't have access to other nodes' data"). Nits: - Line 164: "(n-3)/4" should presumably be "n/4"? - Figure 1 is unreadable in grayscale (e.g. when printed); consider using different symbols rather than just colors (and perhaps make it colorblind-friendly as well) - Line 253: "the attack will go unnoticed no matter which defense the server decides to choose" -- this only applies for the known, particular defenses examined in this paper