NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID: 1283 Eliciting Categorical Data for Optimal Aggregation

### Reviewer 1

#### Summary

The authors present a Bayesian framework to model the collection of categorical data through crowdsourcing platforms. The framework includes: the incentive structure (necessary for crowdsourcing workers to reports their true beliefs), the aggregation mechanism (to synthesize workers’ answers into data), and the interface design (to account for the real-world constraint that one can only ask certain types of questions to crowdsourcing workers). The authors show that their framework properly encompasses both the (stylized) case where workers are honest but heterogeneous, and that of strategic agents seeking to optimize their own reward. Furthermore, the authors show how their framework allows for questions of interface design to be properly posed, and provide both theoretical and experimental results on the optimality of interface design given the assumption on the workers.

#### Qualitative Assessment

The setting considered is extremely relevant to any application that relies on categorical data collected from crowdsourcing platforms. The authors simultaneously consider the three main challenges posed by this data collection mechanism: incentive structure, aggregation mechanism, and interface design. This is particularly true for industrial applications that require reliable labeled data to be periodically produced. One can find many such applications in e-commerce, digital advertising, and generally in any consumer-facing application where customer preferences drift over time. In such industrial applications, two constraints are given: a data collection budget, and a design budget for the interface. The current paper presents a theoretical framework where one can explore the interplay between the incentive structure and aggregation mechanism (thus driving the budget), and the interface design. I was a little surprised to see that synthetic experiments where used more heavily than real experiments. The reason is twofold: the cost of conducting a real-life experiment for categorical data aggregation is relatively low, and the use of synthetic experiments does not allow to observe how well the model of the labellers’ heterogeneity fits the true labellers’ population. Other than that, I believe this paper provides an interesting framework to explore the tradeoffs practitioners face when collecting categorical data using crowdsourcing platforms.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 2

#### Summary

This paper proposes a Bayesian framework to model the elicitation and aggregation of categorical data as well as how to design the interface for workers to enter inputs. This framework optimizes the joint likelihood of the aggregation method as well as the interface design. The experimental results demonstrated the benefit of this framework and significantly improved the baseline interface and majority voting.

#### Qualitative Assessment

This paper considers the crowdsourcing on categorical labels as a joint optimization problem, including elicit truthful reports from agents, aggregate agents’ report into a single prediction and design the interface to maximize the probability of correctly predicting the ground truth. The paper is very well written and explained. The experiments on synthetic data and real world data on Amazon Mechanical Turk supports the theoretical analysis and demonstrates the effectiveness of optimizing the interface design and aggregation method as a whole. I find the interface design really interesting as I have not seen many active work in this direction. I would like to see how the model performs in more experiments on real world data and under different circumstances (different types of noise). Also it is interesting to consider the worker’s skills/ability in the framework.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 3

#### Summary

The authors introduce a model of crowd-sourced categorical data to guide multiple-choice interface design and aggregation of collected data. The approach is evaluated on synthetic data and in a real-world experiment.

#### Qualitative Assessment

A question about the mathematical setup. The equation for an agent's posterior distribution over theta assumes the samples x_i are independent. The equation that follows for the posterior predictive distribution gives a conditional probability p(x | x_1, ..., x_n). However doesn't this also give a conditional probability p(x_i | x_1, ..., x_i-1) that should be taken into account in p(theta | x_1, ..., x_n)? Also, I wish Figure 5 would show all four combinations of {Baseline,ProbBased} and {Majority Voting,Heuristic Aggregation} instead of only two. This paper is well-written and I found it quite clear even if it's not my field.

#### Confidence in this Review

1-Less confident (might not have understood significant parts)

### Reviewer 4

#### Summary

The paper describes a special setting for crowdsourcing labelling with multiclass labels. Unlike previous studies, where an agent (worker) choose the most likely class label, in the present study, an agent is asked a number of questions about his/her posterior probabilistic distribution over class labels. The problem setting assumes that the agent is incentivized to provide true answers. Unlike previous papers, the goal of the current study is to optimize the interface of the questionnaire. Some theoretical analysis is provided for simple cases of the general setting and some experiments are provided to compare a proposed heuristic approach with simple baselines.

#### Qualitative Assessment

**Novelty** The idea of optimizing a labelling interface, which elicit confidence, is interesting and novel. The problem setting would be a good contribution to the literature on crowdsourcing. However, I am not sure that paper is ready for publication for the following reasons: 1) the theoretical part looks not solid, 2) the proposed algorithm (HA) looks not grounded, 3) the results of experiments are not significant. These points are supported below. **Theoretical part (Section 4):** Though Theorem 2 is interesting, it is very simple. Lemmas 3,4 are reasonable, however, they cover only very special cases. Specifically, Lemma 3 considers only one agent and Lemma 4 assumes that all agents have the same amount of information (they observed exactly $n$ samples). Both lemmas consider only binary case, while it would be interesting if they could be generalized to multiclass setting. In particular, the following statement seems to be provable in a similar way: **Statement.** In the general setting of $k$ classes, among the shadow partitions (and among all $k$-cell partitions), the optimal one is the partition is $D_r = \{p\mid r\in \argmax_x p(x)-p^*(x)\}$, where $p^*(x)=1/k$ is the uniform distribution over classes. By the way, the proofs are not perfectly compact and sometimes can be simplified. For example, the first two outline equations in Section F are redundant, are never used further, so their appearance confuses the idea of the proof. **Proposed algorithm (Section 5):** No approach to solve the optimization problem (1) is proposed in a general case. There is only a heuristic aggregation (HA) method (line 265), which is applicable only in the case when agents have equal accuracy (equal number of observations $n$). Moreover, HA seems to be biased, while I would suggest that some MC-like method would better estimate the posterior probability $P(\teta\mid D^1,\ldots, D^m)$ (see lines 188-189). **Experiments:** I don’t appreciate Section 5.2, because only one simplest baseline is considered (Majority voting) and this baseline does not take into account priors on classes. A more appropriate baseline would be one of the consensus models used in crowdsourcing. In section 5.3, the budget of the two treatments are different and the probabilities of bonus payment are also different. Are the values fitted in a way to guarantee the equality of the expected costs of the two treatments? What were the actual costs in the experiment? If the costs are different in expectation, this makes the results less interpretable. When choosing the threshold and the payment scheme in section 5.3, it is assumed that the principal knows the proportion of examples in categories, however, in practice it is usually unknown, also the proportions of categories for golden set samples is biased towards easy categories. How to choose the threshold in such practical case? What would be if the assumed proportions for categories were wrong? **Other comments:** Lines 162-164 tells nothing new. **Questions:** 49: “If the principal can elicit agents’ entire belief distributions, our framework can achieve optimal aggregation, in the sense that the principal can make predictions as if she has observed the private information of all agents”. Can you reformulate this sentence so that one could understand it before reading the theoretical part? 148 “This assumption can be satisfied by, for example, allowing the principal to ask for an expert opinion outside of the m agents”. Outside expert opinion sounds like a correct answer, which is not noisy (otherwise, what is the difference between the expert and an agent?). Then the distribution $p(x\mid \teta^*)$ of the expert opinion should be deterministic: $p(x\mid \teta^*)=Id(x=\teta^*)$. 252 Does simple majority voting take into account priors? Figure 2. What are the values of $\epsilon$ and $n$ in the experiment reported? 292 I did not get the sentence. 312. “Performing HA for Baseline and Maj for ProbBased both led to higher aggregation errors” What are the quantitative results? It would be interesting to examine. **Misspellings:** 23 “Moreover, disregard of whether full or partial information about agents’ beliefs is elicited, aggregating the information into a single belief or answer is often done …” I believe, there should be “disregarding whether”. 163 ”a agent” -> “an agent” 165 “what are all the scores which are truthful for it”. I believe, “scoring rules” is meant instead “scores”? 194 “The expectation is taken over which partition D^i…” I believe, there should be “which cell”, not “partition”. 236 “if the number of agent is one” -> “if the number of agents is one” In Supplementary file: 387 ”a agent” -> “an agent” 389 “can can” Figure 3: “theta=0” -> “$\theta=0$”. ------ **UPDATE on Author Rebuttal** After reading rebuttal, I still have some additional arguments on some points. - A: While Theorem 2 might seem simple, it resolves an open question in previous work as it was considered impossible to obtain the global PPD from the PPDs of all agents. Though Theorem 2 is reasonable, I would avoid to claim that an open question was resolved. In fact, I don’t think Theorem 2 was previously conjectured in the literature, since it resolves a rather particular and quite simple question (unlike general questions considered in [1]). Moreover, I would rather name it “Statement”, not “Theorem”, since the proof is even easier than to find the right statement. - A: We also share the reviewer's conjecture on the multi-signal case (indeed, Fig 2 provides some evidence), though given the difficulty of proving Lemma 4, we do not think such a result would be easy to establish without new techniques. I claim that I can generalize Lemma 3 to the case of multiple classes (as I stated). I believe, it is a proved statement (in mi mind), not a conjecture. I also believe that the paper would be more thorough if this statement was included. - A: It is true that we do not give a fast algorithm to solve the optimization problem. Ffor small n (as in our experiments), it is feasible to just search over the discretized parameter spaces. Ideally, we could give fast optimization algorithms by MC-like methods. In practice, however, we felt that a heuristic grounded in intuition but supported by the theory would likely be more valuable for practitioners; fortunately, such a heuristic performs very well empirically. Moreover, these heuristics give qualitative insights about how to place thresholds and aggregate. I have two different issues here, which ate not connected: 1. No approach to solve the optimization problem (1) in the general case of workers *with different accuracies*. 2. In the case of equal accuracies, HA seems to be based on *biased* estimates of posterior probability $P(\teta\mid D^1,\ldots, D^m)$, therefore, some sampling methods would be more appropriate. - A: We assumed the prior is known. In practice, there are often historical data that provide good estimate of the prior. It's a very interesting and important future direction to study the robustness of our approach to incorrect priors. It is not clear how one can use historical data of workers’ labels to set the prior, since these labels are usually biased.

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

### Reviewer 5

#### Summary

This paper explores the problems of elicitation and aggregation for categorical data. The authors find a way to elicit truthful responses from agents for categorical setting despite being thought of as an impossible task by a previous publication. The previous publication thought that uncovering a global distribution of agents' beliefs was an impossible task, whereas the authors here find an indirect but robust way to uncover it. The authors construct: 1) A principled Bayesian model to model the agents' beliefs that sets the ground for truthful elicitation. 2) A principled way to construct interfaces (belief distribution partitioning over categorical choices) in order to maximize the accuracy of predicting the true answer directed by their newly introduced aggregation method. 3) A principled way to aggregate agents' responses on arbitrary interfaces. Their experiments show consistent improvements over literature methods, even on data collected by other interfaces designed not to their standards.

#### Qualitative Assessment

Adequate coverage of the literature. Principled mathematical formulation of the joint problem of elicitation and aggregation. Few grammatical errors and misprints, nothing serious, just needs another round of proofreading. The curves in Figure 3 are indiscernible on non-colored prints. In the aggregation part of section 3 "Our Mechanism", the assumption that the distribution p(n) from which the agents' background samples are drawn from, is known to the principal needs justification; a counter argument might be that an agent's collective background is coming from a biased source that is not known to the principal. (Authors said they would add a discussion)

#### Confidence in this Review

1-Less confident (might not have understood significant parts)

### Reviewer 6

#### Summary

This paper takes a principled approach to designing multiple choice survey questions for belief aggregation. The authors assume survey respondents are Bayesian and observe discrete signals about a world state, and the authors seek maximally informative question phrasings about the world state. In tandem, the authors propose a Bayesian aggregation technique for response to these questions.

#### Qualitative Assessment

After reading the authors' rebuttal and working through their proof of Lemma 1 more carefully, I now view the paper more positively and have edited my review. I don't know the aggregation literature extremely well, but the authors' approach to the elicitation problem (finding optimal partitions of the probability simplex to use as queries) strikes me as novel. The paper also seems reasonably well suited to a subset of the NIPS community because of the connections to learning through crowdsourcing and aggregation, even if the application itself seems fairly niche. The elicitation technique is also related at a high level to active learning, since the authors seek maximally informative framings of their questions. I haven’t seen active learning or an information-maximization criterion of that sort applied to question-asking in crowdsourcing before, though there is certainly also theoretical precedent in the linguistics literature (e.g., information theoretic approaches to pragmatics/speech act theory). Some limitations: The authors act under the premise that the principal agent is restricted to asking multiple choice questions. Does restricting to multiple choice questions actually make sense to do in practice for crowdsourcing? The authors should compare accuracy results to a crowdsourcing mechanism that elicits real-valued belief judgements. The statement of Lemma 4 is confusing. Doesn't setting T=n/2 correspond to setting p=1/2, not p equal to the prior? I also don't see in the proof of Lemma 4 where the form of p* is derived. The authors neglect to mention that they do not achieve substantially higher accuracy in their human crowdsourcing experiment compared to the baseline. They mainly achieve higher sample efficiency. An algorithm should be described explicitly in the paper how optimal aggregation is performed for the binary case. This is the main case that the authors focus on, and optimal aggregation seems like it could be described compactly in the main text for this case. Other comments: The proof of Lemma 1 would be more clear if lines 418 and 419 were unpacked with a construction. It's simple enough to write out the one line giving the actual mapping between the ratios and the q values, and I would encourage the authors to do so given their emphasis on this result.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)