Reviews: Anti-efficient encoding in emergent communication

**Update after author response** The response addresses my main concerns from the review and I still recommend acceptance. === This paper provides a focused study of the distribution of message lengths in an emergent communication task. A Lewis-type signaling game is constructed in which referents are generated from a power-law distribution. RNN "speaker" and "listener" models are constructed to communicate via a discrete channel (with variable vocabulary size and max length) and trained to maximize success at the signaling game using a vanilla policy gradient algorithm. It is observed that more frequent referents are associated with *longer* messages from the speaker agent. This is in contrast to natural language (exemplified by corpus data from English and Arabic and two simple computational models). Further studies show that this "unnatural" length distribution can be corrected by adding a simple penalty, and suggest that it is driven by initial conditions for both the speaker and listener models. STRENGTHS - thorough empirical study relevant to ongoing work on emergent communication - nice literature review and well-motivated reference models WEAKNESSES - some experimental & analysis details are unclear or omitted - corpus baselines (and the status of "referents" in natural language generally) are not super well aligned with the main task in the paper While I have a few questions about the experimental setup and some of the linguistic claims, I think this is a thorough and well-executed empirical paper and a useful contribution to the emergent communication literature. I think it should be accepted. REFERENT DISTRIBUTIONS This is my main substantive complaint: line 43 says "[referents are] randomly drawn from a power-law distribution (so that referent frequency is extremely skewed, as in natural language)". This conflates the distribution of *words* with the distribution of *referents*. ZLA says nothing about the distribution of referents. A claim that referents in natural language are also distributed as a power law is non-trivial and requires citation, given that most words (especially the frequent ones!) have no referential function on their own. For the same reason, I'm a little uncomfortable with the models of natural language built from corpus word frequency data, since the processes determining word frequencies in these corpora are fundamentally different from those producing the agent behavior in this paper. I don't think any of this changes the bottom line, and you should definitely keep all the current experiments, but I would like to see this paper be a little more careful about the difference between words, referents, and their associated frequencies in both the motivation section and in comparisons to real-world language data. EXPERIMENTS AND ANALYSIS The central claim in this paper is that message length is anticorrelated with frequency in the base model, but this is only backed up visually (i.e. "the line goes down in Figure 1"). We really need to see a correlation coefficient and a hypothesis test somewhere.... There are a couple of cases where numbers get averaged across training runs, and I'm unclear about what's being averaged. In Fig 2, are we looking at the average length of all rank-i messages? Or the average rank of all length-l messages? Similarly, in Fig 3, are we looking at (average pairwise distance) averaged across training runs? Or (average distance across training runs) averaged pairwise? There are a couple of empirical claims for which no quantitative statement (or even appendix pointer) is provided: "The higher D is, the more accurate the agents become" on 179; "the patterns are general" on 256. MESSAGE DISTRIBUTIONS 31: "There is an inverse (non-linear) correlation between word frequency and length". The precise nature of this non-linearity is one of the most interesting underlying computational-linguistic issues here, and I think the paper would benefit from exploring it a bit more. Of the citations provided for this line, one argues for a power-law distribution of word frequencies, and one argues for a Gamma distribution. The "optimal code" model in this paper has an exponential word frequency distribution, and I think monkey typing probably does as well. What about the learned agents? In addition to plotting things on an absolute message length / frequency scale, it would be extremely interesting to normalize them in some way and try to make a claim about the functional form of the length distribution (esp. as it appears in the penalized vs unpenalized experiment) with an appropriate statistical test (K-S etc.). As a related presentation issue, I think all of the figures in this paper would be clearer on log or log-log scales. MISCELLANEOUS - 15: s/strand/stray? - 21: Nitpick---I find this use of "AIs" needlessly imprecise. Perhaps "automated agents" or "computational agents" instead? - 57: cite Lewis 69 "Convention" for the signaling game. - Eq 1 is Williams 92, right? What's been added from Schulman 15? - 157: "xdistributions" - What does distribution of lengths look like w/r/t a uniform distribution over referents?

After author response, I am keeping my score the same. The authors have promised to add the most important detail I felt was lacking: whether a cost to communication lead to reduced communication success. ----------- This paper investigates emergent communication between two agents in a very simple communication game where the referents are power-law distributed. The authors show that naively training the agents with reinforcement learning using policy gradient methods lead to communication protocols where common referents are associated with longer messages, which contradicts Zipf's Law. This is an important observation which serves to delineate artificial agent communication from human communication. The authors then provide an elegant and simple explanation for why this might be the case grounded in the representational capacity of the listener agent, and Figure 3 seems to provide good evidence for their explanation. It is interesting to speculate whether this is caused by a peculiarity in LSTM dynamics, and whether encoders with alternative architectures (such as hierarchical tree-based encoders) distinguish different features. Further, the authors show that a simple length penalty eliminates the anti-efficient coding behaviour, and results in communication which exhibits a Zipfian distribution. This shows that the conventional view that communication is costless may not be entirely accurate. The authors do not state whether the length penalty affects communication success; this would serve as an interesting comparison. Finally, the authors investigate the symbol statistics of the resulting communication protocol. This is (for me) the weakest section of the paper, as I do not feel it contributes much to the main thrust of the argument. However, the other two sections by themselves justify acceptance. The discussion points raised by the authors are all worthy of further thought. Overall, I feel that this paper raises an interesting and important observation. Do we care solely about agents communicating with each other to achieve task success in situations where success is not possible without communication, or do we study emergent communication as a tool to study the evolution of human language? If the latter, then the exact way we define tasks and train our agents can provide important clues about what pressures shaped human language, and this paper proposes and investigates one such pressure: an inherent cost to producing sounds. The paper is clearly written, and deserves a wider audience at NeurIPS.

Paper ID:	3386
Title:	Anti-efficient encoding in emergent communication

Reviewer 1

Reviewer 2

Reviewer 3