NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1823
Title:Efficient Communication in Multi-Agent Reinforcement Learning via Variance Based Control

Reviewer 1


		
1- The messages exchanged are a bunch of floats. What these messages mean or represent, and how agents can make sense of them looks currently as black magic. There's a lack of explainability, and intuition of "but somehow it works, so..." with no analysis of the messages meaning. 2- The way the "confidence level" of an agent is computed is somewhat naive. 3- Equation 1 is unclear: which variance exactly is computed? The variance between agents' messages or the variance of the messages coming from one agent? 4- The environment chosen (StarCraft) seems adequate, and results are convincing. 5- I found the interpretations offered by the authors of the strategies developed by the agents while communicating interesting and welcome.

Reviewer 2


		
The paper contributes to the overall class of MARL algorithms as another simple communication method that improves performance with reduced communication costs. - I am a bit worried about the methods narrow application. It was only evaluated on a collection of similar Starcraft II environments. It also only works on cooperative environments. - Line 111 the Q function targets should be optimized over s_{t+1}, not s_{t}. I think this is just a typo and does not reflect in the results. - I do find it odd that MADDPG (Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments) was not referenced in this paper. It is very related and has a form of implicit communication. - The change to the learning loss is simple. - There is little discussion on the learning hyper parameter introduced and the messaging thresholds. How are these chosen? How sensitive is the method to these values? It is not explicitly said what values are used for the experiments. I assume the same from the figures. After going over the author response I appreciate the extra analysis put into comparing the method to MADDPG to make sure it is state of the art. It is good to compare these methods across previous benchmarks to show improvement. While the additional hyperparameter analysis is helpful it is a bit obvious of what is normally done. Some discussion on the effects of specific settings might shed more light on how the method works. I have updated my scoring.

Reviewer 3


		
The paper is well written and easy to read. I very much enjoyed reading the paper. 1. Line 151: an individual agent can access the global observation and global history only through the conditioned messages. Is that right? If so, please make it explicit for better clarity. 2. Line 154: The fact that the combiner is just doing element wise addition can also be motivated as each agent trying to pass the message which could be the value of each action from that agent’s point of view. This could also motivate the variance based control loss because when there is not much variance in the message, then that agent do not have any preference over which action to choose and hence its message can be safely ignored. 3. It is not clear whether the communication protocol is used during the training or only during the testing time. I assume that you are using the same communication protocol even during training. Please explain this. I did not verify the correctness of the proof. ########################################################## Originality: The paper proposes a novel variance based loss to reduce communication overhead in a MARL setting. Quality: The work is good enough to be accepted at NeurIPS. Clarity: The paper is well written. I have given few comments above to improve the clarity in the presentation. Significance: Definitely a significant contribution to MARL. ###################################################### Minor comments: 1. Line 232: derivation -> deviation.