This article has been accepted for publication in a future issue of this journal, but has not been

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation

information: DOI 10.1109/TPDS.2015.2457924, IEEE Transactions on Parallel and Distributed Systems
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, TPDS-2014-11-0822
A Crowdsourcing Worker Quality Evaluation Algorithm

on MapReduce for Big Data Applications
Depeng Dang, Ying Liu, Xiaoran Zhang, Shihang Huang, Member, IEEE
AbstractCrowdsourcing is a new emerging distributed computing and business model on the backdrop of Internet
blossoming. With the development of crowdsourcing systems, the data size of crowdsourcers, contractors and tasks grows
rapidly. The worker quality evaluation based on big data analysis technology has become a critical challenge. This paper first
proposes a general worker quality evaluation algorithm that is applied to any critical tasks such as tagging, matching, filtering,
categorization and many other emerging applications, without wasting resources. Second, we realize the evaluation algorithm in
the Hadoop platform using the MapReduce parallel programming model. Finally, to effectively verify the accuracy and the
effectiveness of the algorithm in a wide variety of big data scenarios, we conduct a series of experiments. The experimental
results demonstrate that the proposed algorithm is accurate and effective. It has high computing performance and horizontal
scalability. And it is suitable for large-scale worker quality evaluations in a big data environment.
Index Termscrowdsourcing systems, quality control, Big data, MapReduce, Hadoop
1 INTRODUCTION
rowdsourcing is a distributed problem-solving and

production model [1], [2]. In this distributed computing model, enterprises distribute tasks through the Internet and recruit more suitable workers to involve in the
task to solve technical difficulties [3]. Nowadays, more
and more businesses and enterprises have begun to use
the crowdsourcing model. For enterprises, the
crowdsourcing model can reduce production cost, and
promote their technology and creativity [4]. The
crowdsourcing model is oriented to the public, and every
Internet user can choose to participate in the crowdsourcing tasks that they are interested in to provide solutions
for enterprises. However, for one task, there may be a
large number of workers involved in it and provide solutions. The crowdsourcers will be confused when they
faced with such a huge number of solutions and it is difficult for them to make a final choice. Moreover, not every
person is qualified to serve enterprises because of their
different backgrounds and different personal qualities.
There may even be malicious workers in crowdsourcing
platform [5], [6]. Therefore, worker quality control has
gradually become an important challenge for the
crowdsourcing model [7]. It is of great importance to
mine the information about the workers self quality from
a large number of worker data to provide the
crowdsourcers some reference. This paper mainly studies
the core problem of worker quality control: worker quality evaluation. The worker quality evaluation will help
enterprises recruit high-quality workers who can provide
them high-quality solutions. It is of great significance to
both the quality of the tasks and the environment of the
crowdsourcing platform [8].
D.Dang, Y.Liu, X.Zhang and S.Huang are with College of Information

Science and Technology, Beijing Normal University, China, 100875. Email: { ddepeng@bnu.edu.cn, liuy066@163.com, 513276439@qq.com,
shihanghuang8@gmail.com}.
Crowdsourcers almost release tasks at all times due to

the large-scale crowdsourcing platform. Additionally, a
large number of workers participate in these tasks. Therefore, the crowdsourcing platform will generate a large
amount of data every moment, including crowdsourcing
tasks, worker behaviours, and the solutions of tasks. The
large amount of data put forward new demands to the
calculated performance of crowdsourcing platform. The
use of big data technology [9], [10], [11], [12], [13] to specially process these massive data is a key issue that the
crowdsourcing platform needs to consider.
Therefore, to evaluate the quality of the workers in the
crowdsourcing platform accurately, we first propose a
general worker quality evaluation algorithm. This algorithm achieves the worker quality evaluation for multiple
workers and multiple problem types with no predeveloped answer, and the algorithm has a stronger
scalability and practicality compared with the algorithm
presented in reference [14]. Second, we propose to use the
MapReduce programming model [15], [16], [17], [18] to
realize large-scale parallel computing for worker quality
and implement the proposed algorithm in the Hadoop
platform. Finally, we conduct a series of experiments to
analyse and evaluate the performance of worker quality
evaluation algorithm. The experimental results show that
the proposed algorithm is effective and has a high performance. It can meet the needs of parallel evaluation of
the large-scale workers in a crowdsourcing platform for a
big data environment.
This paper is organized as follows. Section 2 mainly introduces the related work. Section 3 introduces the proposed worker quality evaluation algorithm. The implementation of the algorithm on MapReduce will be introduced in section 4. In section 5, a series of experiments are
designed to analyse and evaluate the accuracy, effectiveness and performance of the algorithm. Section 6 summarizes the paper and gives an outlook on future research
1045-9219 (c) 2015 IEEE. Personal

use is permitted, but republication/redistribution
requires IEEE permission. See
xxxx-xxxx/0x/$xx.00 200x IEEE
Published by the IEEE Computer Society
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
2
directions of our work.
2 RELATED WORK
As an emerging distributed computing model,
crowdsourcing has become an important and active research field in recent years [19], [20], [21]. Crowdsourcing
arise in many forms, such as citizen, peer production/cocreation, wisdom of crowds, collective intelligence, and so
on [22]. In the past few years, a large number of
crowdsourcing platforms have been set up and used in
many fields [23], [24], [25], [26]. Amazon Mechanical Turk
(AMT) is one of the most prominent crowdsourcing platforms today. It is a crowdsourcing Internet marketplace
that enables workers and crowdsourcers to coordinate the
use of human intelligence to perform tasks that computers are currently unable to do. Most of these crowdsourcing systems rely on offline or artificial worker quality
control and evaluation or simply ignore the quality control issues. There are increasing numbers of researches
focus on quality control issue at present [27], [28]. E.
Kamar [29] presents a model to enable the system to balance the expected benefit versus the costs of hiring a
worker. D. Vakharia [22] studies the issue of quality assurance and control and it is an important part of their
research. The multiple crowdsourcing platforms discussed in [22] use different methods to implement worker
quality control. These worker quality control methods
more or less require human intervention, which places
burden on crowdsourcers. Worker quality control has
become a bottleneck, affecting crowdsourcing system development [30]. The core issue of worker quality control
is the worker quality evaluation [31], [32], and the online
worker quality evaluation is attracting more and more
public attention.
J. M. Rzeszotarski [33] distinguishes the quality of different workers through analyzing the behaviours of the
workers. However, this method requires the crowdsourcing system to provide the workers behaviour logs. R.
Snow, J. Whitehill, V. C. Raykar, and X. Liu [34], [35], [36],
[37] are mainly based on the EM algorithm [38], [39], [40],
[41] to calculate the accuracy of the worker and mining
the potential quality of the worker using the answers matrix. These studies are all focused on the determination of
a single label. M. Joglekar [14] studies the worker quality
evaluation based on the frequency of disagreement regarding the results among workers. It also uses confidence intervals to evaluate the accuracy of the worker,
which improves the evaluation accuracy. However, this
study only applies to a Boolean problem, and there are
some constraints on the quality of the worker to be evaluated (>0.5). A. Ramesh [42] mainly studies the dynamical
control of worker behaviours in the process of evaluation.
The research on worker quality evaluation is a slightly
weak, and the evaluation model is simple. P. Welinder
[43] studies the evaluation of workers using the aged EM
algorithm. P. G. Ipeirotis [44] analyses a workers preference through the worker quality evaluation. The studies
above are mostly focused on the traditional architecture
and do not take the big data environment into considera-
tion; thus the practicability and extensibility of these studies are not sufficient.
In recent years, with the continuous development of
cloud computing and the explosive growth of data size,
data-driven has already become the focus of attention for
enterprises. The crowdsourcing model also faces the challenge of big data [45]. Unlike the existing research, ours is
the first paper that considers worker quality evaluation in
a big data environment. Above all, the paper proposes a
general crowdsourcing worker evaluation algorithm, and
we implement it in the Hadoop platform using the
MapReduce programming model.
3 WORKER QUALITY EVALUATION ALGORITHM

In section 3.1, we first propose a single choice-oriented
worker quality evaluation algorithm, which is named the
M-1 Algorithm. Based on the M-1 algorithm, we further
propose a multi-worker evaluation scheme (section 3.2)
and a multiple-choice evaluation scheme called the M-X
Algorithm (section 3.3).
3.1 M-1 Algorithm

The idea of the M-1 algorithm is described as follows:
Suppose all of the provided problems are of the same
type (single choice), and have no pre-developed answer.
Let three workers w1 , w2 and w3 answer these problems
independently at the same time. The number of problems
is N. Then, we will calculate each workers accuracy to
these problems according to the similarities in their responses. Here are several basic definitions in the algorithm.
Definition 1. The variable M represents the number of
options of a problem. The value of M is different for different problem types. For example, for Boolean problem,
the value of M is 2. For the today's favoured single choice
problem, the value of M may be 3, 4, 5, etc. And for label
or categorization problem, the value of M will be the total
number of categories.
Definition 2. The variable A represents the accuracy
rate of the workers. The worker set is represented as
{wi |1 i K, i is an integer}. Herein, the variable K represents the total number of the workers. The variable Ai
represents the accuracy rate of wi . That is, for each problem, the probability that the worker wi gives the correct
answer is Ai , and the probability for the wrong answer is
1 Ai .
Definition 3. The problem set is represented as
{pu |1 u N, u is an integer} . The random variable Xiju
records the consistency of the responses of different
workers wi and wj to the same problem pu . The value of
Xiju is either 0 or 1. If the responses of wi and wj are the
same to problem pu , then set Xiju = 1, else Xiju = 0. The probability distribution of Xij is a Bernoulli trial.
Definition 4. The variable Tij records the number of
times workers wi and wj agree out of N problems, that is
Tij
1u N
X iju .
Definition 5. The variable Q ij represents the expecta-
1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
D.DANG ET AL.: A CROWDSOURCING WORKER QUALITY EVALUATION ALGORITHM ON MAPREDUCE FOR BIG DATA APPLICATIONS
tion of the Bernoulli trial, that is, the probability that the
response for wi and wj will agree with each other.
T
Qij ij
N
There are a wide range of tasks in crowdsourcing platform. Different tasks own different M value and have
their own characteristics. The newly-released tasks may
have completely different modes compared with historical tasks. Moreover, all the newly-released tasks have no
pre-developed answers, and the order of the options is
unpredictable. Therefore, it is difficult to predict which
option is more likely to be the correct answer and the
workers preference for different options. To provide a
general solution for crowdsourcing worker quality evaluation, we assume that each worker owns the same probability to select each wrong option for one problem. According to the idea of the M-1 algorithm and the definitions above, we can obtain the following equation.
1
1
Qij Ai Aj (
(1 Ai ))(
(1 Aj )) ( M 1) (1)
M 1
M 1
Herein, Ai represents the probability that wi chooses
the correct option, while (
M1
(1 Ai )) represents the
probability that wi choose one of the wrong options.

Therefore, Ai Aj represents the probability that worker wi
and worker wj choose the correct option at the same time,
and (
1
M1
(1 Ai )) (
M1
(1 Aj )) (M 1) represents
the probability that worker wi and worker wj choose the

same wrong option at the same time.
According to (1), for the three workers w1 , w2 and w3
who answer the same N problems at the same time, we
can obtain equations (2) to (4):
1
1
(2)
Q12 A1 A2 (
(1 A1 ))(
(1 A2 )) ( M 1)
M 1
M 1
1
1
(3)
(1 A1 ))(
(1 A3 )) ( M 1)
M 1
M 1
1
1
Q23 A2 A3 (
(1 A2 ))(
(1 A3 )) ( M 1) (4)
M 1
M 1
Then, we can obtain the accuracy (i.e., Ai ) of each
worker by solving the above equations. The accuracy of
worker w1 is calculated as follows:
1
M 1 ( M Q12 1) ( M Q13 1)
(5)
A1
M
M
M Q23 1
Q13 A1 A3 (
We can calculate A2 , A3 in the same way. Algorithm 1

describes the evaluation process of the M-1 Algorithm.
Algorithm 1: The outline of the M-1 algorithm
Input: TaskId
N problems , , ,
Three workers , , .
Output: accuracy rate of each worker
1 while TaskId do
2
let three workers , , answer
3
record the answer of each to , which represents as
4 end while
5 for every two person and in , , do
6
for TaskId
7
8
9
10
11
12
13
if ==
set to 1
else
set to 0
end for
calculate the value of , . = , =
then formulate the following equations for and
14
= + (
( )) (
( )) ( )
15 end for
16 calculate the accuracy rate of using the three equations
obtained in row 14
17 feedback to crowdsourcer
3.2 Multi-worker Evaluation Scheme Based on M-1

Algorithm
In the M-1 algorithm, we solve the problem of threeworker quality evaluation. However, there may be multiple workers involved in the same task at the same time in
actual crowdsourcing environment. How to evaluate the
start
start
start
k-1
k-2
k-1
1
k-2
k-1
1
k-2
......
...
2
...
Step 1
...
2
...
...
2
...
Step 2
Step k
Fig. 1. The process of multi-worker evaluation scheme.
quality of multiple workers is a more practical problem

that remains to be solved. Therefore, we propose a multiworker evaluation scheme based on the M-1 algorithm,
which uses the idea of sliding window. Fig. 1 shows the
process of multi-worker evaluation scheme.
According to Fig. 1, the idea of the multi-worker evaluation scheme is described as follows. There are K workers sequentially arranged in a circle, and the window size
is set to three. Then, we randomly start with a worker
w0 and calculate the accuracy of the three workers in the
window using the M-1 algorithm. After that, the window
slides one step, and we calculate the accuracy of another
three workers in the window using the same method.
Until the window starts with worker w0 again, the algorithm ends. Thus, the accuracy of each worker is calculated three times, and we calculate the average value of the
three accuracy values for each worker as their final accuracy value. The result using the sliding window-based
algorithm has a smaller random error than the result calculated only once, and it is closer to the real worker accuracy. Algorithm 2 shows the outline of the multi-worker
evaluation scheme.
Algorithm 2: The outline of the multi-worker evaluation scheme
Input: TaskId
N problems , , ,
K workers , , ,
1 K workers arranged in a circle
2 Define i=0
3 while i<K
4
4
5
6
7
8
9
calculate the accuracy rate , (+)% , (+)% of

, (+)% , (+)% using Algorithm 1
record each , (+)% , (+)%
i=i+1
end while
for each
calculate the average value as final using the recorded
in step 5
10 =
11end for
3.3 M-X Algorithm

Compared to single choice, multiple-choice is a more
general problem type. Or rather, single choice is a special
form of multiple-choice. For example, for some labeling
issues, we only need to assign one label to each object.
However, for most cases, we need to assign multiple labels to each object. Therefore, in this section, we further
discuss the worker quality evaluation issue based on multiple-choice.
Algorithm 3: The outline of the M-X algorithm
Input: TaskId
N multiple-choice problems , , ,
K workers , , ,
1 for each problem TaskId
2
divide the M options of into M sub-problems
3
treat each option as a single choice problem ( = , , , )
4 end for
5 for each option
6
take the certain out of each problems and group them as a
sub-task, which consists of N single choice problems
7
calculate the accuracy rate of , , , on using
Algorithm 2
8
record each
9 end for
10 for each
11 calculate the comprehensive accuracy rate using the
recorded in step 8
12 =
13 end for
The M-X algorithm focuses on multiple-choice problems. For multiple-choice, the answer to the same problem for different workers tends to have large differences.
So when we address a multiple-choice problem, we can
hardly use the M-1 algorithm to evaluate worker quality
directly. Therefore, based on the M-1 algorithm, we propose a multiple-choice problem-oriented Worker Quality
Evaluation Algorithm, which is named the M-X algorithm. The idea of the M-X algorithm is as follows. First,
we divide a problem into M sub-problems according to its
M value, and M value represents the number of options
of each multiple-choice problem. Each option Oj (j=1, 2,,
M) is treated as a single choice problem with two options,
which represents choosing the option or not. Thus, every
multiple-choice problem with M options is converted to
M single choice problems. Second, for each option dimension, we treat it as a sub-task, which consists of N single
choice problems. In this way, each task is divided into M
sub-tasks. Then we use Algorithm 2 to calculate the
workers accuracy on each option dimension respectively.
Finally, for each worker, we gather up the accuracy on all

the option dimensions to calculate the comprehensive
accuracy value of workers to the multiple-choice problem. For each multiple-choice problem, the necessary and
sufficient condition that an answer is correct is that the
worker chooses all of the correct options for the multiplechoice problem; thus the final accuracy of a worker is the
product of all of the accuracy values obtained from each
options dimension. Algorithm 3 describes the outline of
the M-X Algorithm.
4 IMPLEMENTATION ON MAPREDUCE
To cope with the challenges brought by big data and improve the efficiency of the crowdsourcing worker quality
evaluation algorithm in big data, we use the MapReduce
parallel computing framework to implement a general
algorithm called the MRM-X algorithm based on the algorithm proposed in section 3. MapReduce is a parallel programming model and computing framework for processing massive data, which solves the scalability, fault
tolerance and other issues at system level. By accepting
the user-written Map function and Reduce function, it can
automatically execute in parallel on the scalable largescale clusters; thus, it can process and analyse a largescale data set [46], [47].
In actual crowdsourcing platforms, one task may contain different problem types, including single choice and
multiple-choice. According to the idea of the M-X algorithm, first, we need to convert multiple-choice problems
to single choice problems. And then we calculate the
workers accuracy according to the multi-worker evaluation scheme of the M-1 algorithm. Considering the characteristics of MapReduce programming model, we design
three MapReduce tasks in this section. Task one is mainly
responsible for the data pre-processing, including problem type conversion and classifying workers. Task two is
mainly responsible for using the M-1 algorithm to calculate the accuracy of the workers. Task three will calculate
the average accuracy of the workers. Fig. 2 illustrates the
process of the MRM-X algorithm.
The original data format is <Wid, Tid, Pid, Ptype, Sid>,
which represents worker id, task id, problem id, problem
type and workers response, respectively.
1) Task One
As described above, task one will first pre-process the
initial data by Ptype to obtain the data set that can be processed by multi-worker evaluation scheme of the M-1
algorithm. And then it will group the workers who are
involved in the same task.
Map-1 processes the initial data and preprocesses different problems according to Ptype. If it is a multiplechoice problem, we translate the M options into M single
choice problems. Then, the M single choice problems will
be numbered sequentially. After this operation, we combine the id of each single choice problem with the original
Pid as a new Pid. Moreover, we set the workers response
Sid as one if the worker selects the option; otherwise, we
set it to zero. If it is a single choice problem, we ignore
this step. Then, the algorithm regards <Tid+Pid> as a key
<W id , Tid , Pid , Ptype, Sid >

input
...
Mapper
Mapper
...
Mapper
Mapoutput
Reducer
...
<Tid +Pid , W id +Ptype+Sid >
Reducer
output
<W i +W j +W k , Tid +Pid +Ptype+Si +Sj +Sk >
input
<W i +W j +W k +Tid ,
Pid +Ptype+Si +Sj +Sk >
<W id +Tid , A 1 +A 2 +A 3 >

Mapper
...
...
Mapper
Mapper
Mapper
...
Mapper
...
Mapper
Mapoutput
Mapoutput
Reducer
...
Reducer
Reducer
output
input
...
Reducer
output
<W id +Tid , avgA id >
<W i +W j +W k +Tid , A i +A j +A k >
HDFS
Fig. 2. The flow chart of the crowdsourcing worker quality evaluation for the MRM-X algorithm.
to shuffle and assigns the worker who owns the same

task id to the same Reducer.
Reduce-1 receives the output of Map-1 as an input and
groups the workers who own the same task id. To maintain the consistency of worker grouping on different reduce tasks, we need to sort the workers with the same
task id before grouping and then adopt the sliding window to group every three workers as a unit. Furthermore,
take the combined workers id as a key, which means the
output
is
in
the
form
of
<Wi+Wj+Wk,
Tid+Pid+Ptype+Si+Sj+Sk>.
2) Task Two
The second MapReduce task receives the output of
Reduce-1 as an input. It aims at calculating the accuracy
of various workers.
Map-2 processes the output data of Reduce-1. To assign the three workers who participate in the same task
and also in the same sub group to the same Reducer,
Map-2 takes Tid and the users combination ID Wi+Wj+Wk
as the output key of the map task.
Reduce-2 receives the output of Map-2 as an input. For
all of the values that have the same key, we group them
according to Ptype and use the proposed M-1 algorithm to
calculate workers partial accuracy on different Ptype.
Then, we calculate each workers accuracy according to
each workers partial accuracy. Finally, the output is in
the form of <Wi+Wj+Wk+Tid, Ai+Aj+Ak>.
3) Task Three
The algorithm adopts the sliding window algorithm to
calculate the worker accuracy, so each workers accuracy
is calculated three times. We take the average value of the
three accuracies as the indicator to evaluate the worker
quality to avoid the accuracy evaluation bias caused by a
single calculation.
Map-3 takes <Wid+Tid> as the key to shuffle and assigns the same workers three accuracies of one task to the
same Reducer.
Reduce-3 receives the output of Map-3 as an input to
calculate the average accuracy of each worker. The output
is in the form of <Wid+Tid, avgAid>, and avgAid is the final
result.
5 EXPERIMENTAL RESULTS
In experiment one of this section, we recruit 10 workers to
involve in the same task. And then we use the proposed
worker quality evaluation algorithm to calculate each
workers accuracy to preliminary verify the effectiveness
of our algorithm. To more effectively verify the accuracy
and effectiveness of the algorithm in a wide variety of big
data scenarios, we further conduct a series of simulation
experiments to analyse and evaluate the performance of
the worker quality evaluation algorithm in this section.
Experiments are conducted on the Hadoop platform with
simulation data. For one task, we first randomly generate
the answers to the problems according to the problem
types (Boolean, single choice, multiple-choice and so on)
in the task. Then we randomly generate the workers with
different levels. The workers levels are mainly determined by the accuracy. And the accuracy is between 0
and 1. Finally, we generate each workers responses to
each problem according to the worker accuracy that we
generated.
The
scale
of
the
data
set
is
[10000*100*(20|50|100|200|500)], which means that
10,000 workers participate in 100 tasks. When each task
includes a different number of problems, such as [20, 50,
100, 200 and 500], we run the algorithm to observe the
6
TABLE 1
THE DISTRIBUTION OF THE PROBLEMS
Problem Type
Boolean
M Value
2
3
4
5
4
Single choice
Multiple-choice
Number
20
20
20
20
20
accuracy and effectiveness of the algorithm.

Experiment 1: We recruit 10 workers to involve in this
task. The task consists of 100 problems. The distribution
of the problems is shown in Table 1.
First, we compare the responses of the workers with
the pre-developed answers to obtain each workers real
accuracy. Second, we use the proposed worker quality
evaluation algorithm to calculate each workers accuracy.
Finally, we compare the above two kinds of accuracy of
the workers. Because there are a variety of problem types
in this task, the worker quality evaluation algorithm that
based on Boolean type can hardly calculate the workers
accuracy in this case. To further verify the effectiveness of
our algorithm, we make some improvements to the votebased evaluation algorithm which is detailed in [42] to
make it applicable to multiple-choice problems. The details of the improvement we made are as follows. For a
multiple-choice problem with M options, the answer to
the problem has 2M-1 possibilities. If one task involved
few workers, we can hardly get the answer to the problem by voting. Therefore, in the case of multiple-choice
problem, we first divide the problem into M single choice
problems according to its M value. That is, each option is
treated as a single choice problem with two options,
which represents choosing the option or not. For these M
single choice problems, we then use the voting method in
[42] to obtain the answer to each option respectively. Finally, we gather up all the answers to the M options
above to obtain the answer to the multiple-choice problem. Then we use the improved algorithm to calculate the
workers accuracy, and compare the obtained results with
1
accuracy obtained from our algorithm is more close to the

worker real accuracy. This experiment offered us a proof
for the effectiveness of our algorithm.
Experiment 2: When the data set is [10000*100*20], we
randomly select a worker from the experiment results to
observe the degree of deviation between the worker accuracy calculated by the algorithm and the real worker accuracy. In this experiment, the worker's real accuracy is 0.66.
Fig. 4. The normal Q-Q chart for the expectation of worker accuracy.
Fig. 5. The normal Q-Q chart for the deviation of worker accuracy.
0.9
0.8
Accuracy
0.7
0.6
0.5
0.4
0.3
Real Accuracy
0.2
Our Algorithm
0.1
Vote-based Algorithm
0
1
10
worker
Fig. 3. The comparison between the worker accuracy values based on
different methods and the worker real accuracy.
ours. Fig. 3 shows the comparative results. The worker
We record the worker accuracy of 100 tasks calculated by

the algorithm. Fig. 4 shows the normal Q-Q chart for the
expectation of worker accuracy. Fig. 5 shows the normal QQ chart for the deviation of worker accuracy.
From Fig. 4 we can see that the slope of the straight line
between the normal value and the observed value of the
expectation is one and the distribution of the observed value is almost on the diagonal line. According to Fig. 5, the
normal distribution of the deviation is close to y=0. Therefore, the values of the worker accuracy are in accordance
with the characteristic of normal distribution. This experiment demonstrates that the worker accuracy calculated by
the algorithm approaches to the worker real accuracy (deviation<0.05), so the algorithm has a high level of accuracy
for this worker.
Experiment 3: To verify the effect of the number of
TABLE 2
SINGLE SAMPLE KOLMOGOROV-SMIRNOV TEST (K=1)
20
50
100
200
500
100
100
100
100
100
.6431
.6818
.6626
.6694
.6661
.11847
.08880
.06160
.03810
.02888
absolute value
.051
.072
.064
.046
.051
positive
.036
.052
.064
.046
.048
negative
-.051
-.072
-.060
-.044
-.051
Kolmogorov-Smirnov Z
.506
.716
.644
.460
.514
Progressive significant (Bilateral)
.960
.684
.802
.984
.954
Normal parameter
a, b
average
standard deviation
Most extreme difference
TABLE 3
SINGLE SAMPLE KOLMOGOROV-SMIRNOV TEST (K=10)
1
N
10
100
100
100
100
100
100
100
100
100
100
.8867
.6716
.8395
.8741
.7652
.7714
.6753
.6069
.9131
.6714
.8972
.6894
.8553
.8885
.7780
.7850
.6865
.6171
.9233
.6912
.04073
.05488
.05411
.04854
.04619
.06413
.07314
.07366
.03844
.05881
absolute value
.020
.038
.026
.029
.038
.020
.061
.034
.020
.032
positive
.011
.027
.026
.029
.024
.014
.044
.029
.020
.020
negative
-.020
-.038
-.017
-.015
-.038
-.020
-.061
-.034
-.013
-.032
.623
1.188
.823
.911
1.209
.634
1.942
1.091
.621
1.016
.832
.119
.507
.378
.108
.817
.201
.185
.835
.254
Real accuracy
Normal parameter
a, b
average
standard deviation
problems in a task on the effectiveness of the algorithm,

experiment 3 is designed to further observe the differences
among worker accuracy when the problem number changes. Here, task number is set to 100, and the number of problems is set to 20, 50, 100, 200 and 500, respectively. In this
experiment, we observe the accuracy of the worker who we
talked about in experiment 2. We use the KolmogorovSmirnov (K-S) analysis on the worker accuracy that is calculated by the algorithm and the results are shown in Table
2. As seen from the progressive significant, its value meets
the condition (>0.05), which is in accord with normal distribution. Meanwhile, with the number of problems increasing, the average value of worker accuracy calculated
by the algorithm is increasingly close to the real accuracy,
and the variance is becoming smaller. Thus, it is confirmed
that the effectiveness and stability of the algorithm are becoming better. Additionally, when the number of problems
varies, the average value of accuracy is substantially closer
to the real accuracy with a small standard deviation, which
further demonstrates that the accuracy calculated by the
proposed algorithm is steady and credible.
Experiment 4: To further verify the accuracy and validity of the algorithm, experiment 4 is designed to observe
multiple workers at the same time. When the experimental
data set is [10000*100*100], that is, the number of tasks is
100 and each task includes 100 problems, we adopt the
method of interval random sampling to randomly select 10

workers from the experiment results, and the sampling
interval is 1000. Then we analyze the data of these 10
workers and use the K-S test to verify the worker accuracy
calculated by the algorithm. Table 3 shows the result of the
K-S test for 10 workers.
From Table 3, we can see that for any 10 workers selected randomly, the worker accuracy obtained from the algorithm still meets the condition of the K-S test. And the average value equals almost worker real accuracy for each
worker. Thus, the algorithm has the advantage of good
generality.
In order to avoid the data skew and better verify the reliability of the algorithm, we have done such interval random sampling for 10 iterations. Then we analyze the deviation between the expectation of worker accuracy obtained
from the algorithm and the worker real accuracy. Table 4
shows the expectation and variance of the deviation of the
10 groups of samples respectively. From table 4 we can see
that the expectation of the deviation is acceptable, and the
variance of the deviation is small, which indicates the reliability and stability of the experimental results.
Experiment 5: This experiment is designed to observe
the differences in accuracy between different workers by
measuring and statistically analysing the Euclidean distance between different workers accuracy. In this experi-
8
TABLE 4
SINGLE SAMPLE KOLMOGOROV-SMIRNOV TEST
1
N
10
10
10
10
10
10
10
10
10
10
10
.0136
.0155
.0133
.0142
.0141
.0132
.0162
.0113
.0089
.0172
.00334
.00562
.00492
.00517
.00614
.00515
.00607
.00652
.00547
.00495
absolute value
.166
.199
.217
.169
.201
.173
.170
.184
.128
.160
positive
.166
.199
.156
.169
.110
.121
.170
.184
.124
.160
negative
-.153
-.126
-.217
-.129
-.201
-.173
-.162
-.141
-.128
-.154
.526
.628
.688
.535
.636
.548
.536
.581
.405
.506
.945
.825
.732
.937
.813
.925
.936
.888
.997
.960
Normal parameter
a,b
average
standard deviation
TABLE 5
APPROXIMATE MATRIX OF EUCLIDEAN DISTANCE BETWEEN DIFFERENT WORKERS ACCURACY (K=10)
Euclidean Distance
1
10
.000
6.915
2.470
1.844
4.235
4.304
7.179
9.243
1.925
6.873
6.915
.000
5.757
6.724
3.589
4.024
2.789
3.725
7.703
2.536
2.470
5.757
.000
2.505
3.316
3.508
6.041
8.082
2.975
5.768
1.844
6.724
2.505
.000
4.113
4.145
6.952
9.028
2.208
6.661
4.235
3.589
3.316
4.113
.000
2.524
3.978
5.755
4.966
3.614
4.304
4.024
3.508
4.145
2.524
.000
4.376
6.132
4.961
4.022
7.179
2.789
6.041
6.952
3.978
4.376
.000
3.857
7.935
3.027
9.243
3.725
8.082
9.028
5.755
6.132
3.857
.000
10.022
3.739
1.925
7.703
2.975
2.208
4.966
4.961
7.935
10.022
.000
7.670
10
6.873
2.536
5.768
6.661
3.614
4.022
3.027
3.739
7.670
.000
ment, we still use the 10 workers randomly selected in experiment 4. From Table 5, we can observe that the larger
the difference between different workers real accuracy is,
the larger the Euclidean distance is (e.g., worker 8 and
worker 9); the smaller the difference between different
workers real accuracy is, the smaller the Euclidean distance is (e.g., worker 1 and worker 4), which verifies that
the workers quality are more similar in multiple tasks.
Therefore, the algorithm can distinguish and reflect different workers quality.
Experiment 6: This experiment focuses on discussing
the effect of the MapReduce on the promotion of the performance of the algorithm. The algorithm involves three
variables, including the number of workers, the number of
tasks and the number of problems in each task. The scale of
the problem is N3. The proposed worker quality evaluation
algorithm can be divided into three steps:
(1) Sort and group the workers who involved in each
task. The time complexity of the existing sorting algorithm
can reach O(NlogN). So the time complexity of this step is
O(N2logN).
(2) Calculate the worker accuracy. For N tasks, each task
can obtain N groups according to the algorithm. And the

computing time of each group is O(N). So the time complexity of this step is O(N3).
(3) Calculate the average accuracy for each worker. The
time complexity of this step is O(N2).
MapReduce parallel framework takes the idea of divide
and conquer. In theory, these three steps can all be parallel
processed on N2 processors. So the speed-up ratio can
reach O(N2). Actually, the N value may be very large, and
the N2 value will be larger. In reality, it is difficult to
achieve N2 processors. In the case of n processors, and
wherein, n<<N, the time complexity is O(N3/n), the speedup ratio is O(N3)/ O(N3/n) ~ O(n).
To verify the acceleration effect of the algorithm in
MapReduce framework, we conduct two experiments.
Experiment 6.1 was conducted on a single node by multithreading. We find that with the increase of the number
of threads, the execution time of the algorithm doesnt
reduce, as shown in Fig. 6. This is because the algorithm
is a computation-intensive algorithm, so the increase of
the number of threads cannot reduce the calculation time
when on a single kernel machine. Experiment 6.2 ob-
serves the performance of the MRM-X algorithm when

the algorithm runs in clusters with a different number of
nodes, as shown in Fig. 7. The data sets are
[10000*100*100] and [20000*100*100], respectively. Although the speed-up ratio of the performance of the algorithm did not achieve O(n) (this is because the increase of
time/ms
20000*100*100
8300000
8250000
8200000
8150000
8100000
8050000
8000000
7950000
7900000
7850000
1
thread/n
Fig. 6. The variation of calculation time with a varying number of

threads on one node.
time/ms
10000*100*100
20000*100*100
9000000
8000000
7000000
6000000
5000000
4000000
3000000
2000000
1000000
0
can guarantee the calculated performance in a big data

environment by increasing the number of nodes.
6 CONCLUSION
In this paper, we first proposed a general worker quality
evaluation algorithm, which is applied to any critical
crowdsourcing tasks without pre-developed answers.
Then, to satisfy the demand of parallel evaluation for a
multitude of workers in a big data environment, we implement the proposed algorithm in the Hadoop platform
using the MapReduce programming model. The experimental results show that the algorithm is accurate and
has high efficiency and performance in a big data environment.
In our future studies, we will further consider other
factors that affect worker quality, such as answer time
and task difficulty. And these factors will help realize the
comprehensive evaluation of worker quality to adapt the
worker quality evaluation issue under different situations
for the crowdsourcing mode in a big data environment.
ACKNOWLEDGMENT
D.Dang is the corresponding author of this paper. This
paper is supported by the National Natural Science
Foundation of China under Grant No.60940032,
No.61073034, and No.61370064; the Program for New
Century Excellent Talents in University of Ministry of
Education of China under Grant No.NCET-10-0239; and
the Science Foundation of Ministry of Education of China
and China Mobile Communicaions Corporation under
Grant No. MCM20130371.
REFERENCES
[1]
1
node/n
Fig. 7. The variation of calculation time with a varying number of nodes.
the communication time in a distributed computing environment), the execution performance of the algorithm has
been significantly improved. Moreover, the larger the
dataset, the more obvious the acceleration is.
From Experiment 6.1 and 6.2, we can see that for computation-intensive tasks, the algorithm on a single machine is unable to solve the performance problem. While
it can be done by distributed computing, and MapReduce
parallel framework is just a good choice. Moreover,
MapReduce just distributes the computing tasks to the
cluster and it does not change the process of the algorithm. Therefore, it will improve the performance without
compromising the accuracy of the algorithm. Moreover,
The MapReduce cluster has horizontal scalability. The
calculated performance of MapReduce can remain approximately linear growth varying with the increase of
the number of nodes. With the expansion of the data size,
the algorithm shows sustained effectiveness. Thus, we
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
D.C. Brabham, Crowdsourcing as a Model for Problem Solving: An Introduction and Cases, Convergence the International Journal of Research Into New Media Technologies,
vol. 14, no. 1, pp. 75-90, 2008.
M. Allahbakhsh, B. Benatallah, A. Ignjatovic, et al, Quality
Control in Crowdsourcing Systems: Issues and Directions,
IEEE Internet Computing, vol. 17, no. 2, pp. 76-81, 2013.
A. Doan, R. Ramakrishnan, and A.Y. Halevy, Crowdsourcing
Systems on the World-Wide Web, Communications of the
ACM, vol. 54, no. 4, pp. 86-96, 2011.
P. Clough, M. Sanderson, J. Tang, et al, Examining the Limits
of Crowdsourcing for Relevance Assessment, IEEE Internet
Computing, vol. 17, no. 4, pp. 32-38, 2013.
B. Carpenter, Multilevel Bayesian Models of Categorical Data
Annotation, unpublished, 2008.
A. Brew, D. Greene, and P. Cunningham, Using crowdsourcing and active learning to track sentiment in online media, In
Proceedings of the 6th Conference on Prestigious Applications
of Intelligent Systems, 2010.
J. Howe, The Rise of Crowdsourcing, Wired Magazine, vol.
14, no.14, pp. 176-183, 2006.
V. C. Raykar, S. Yu, L. H. Zhao, et al, Learning From Crowds,
Journal of Machine Learning Research, vol. 11, no. 2, pp. 12971322, 2010.
J. Manyika, M. Chui, B. Brown, et al, Big Data: The next fron-
10
tier for innovation, competition, and productivity, 2011.

ings of the 11th International Conference on Autonomous
[10] S. C.H. Hoi, J. Wang, P. Zhao, et al, Online feature selection for
Agents and Multiagent Systems, vol. 1, pp. 467-474, 2012.
mining big data, BigMine, pp. 93-100, 2012.
[30] D. Schall, Automatic Quality Management in Crowdsourcing,
[11] K. Michael, K.W. Miller, Big Data: New Opportunities and
IEEE Technology and Society Magazine, vol. 32, no. 4, pp. 9-13,
New Challenges, Computer, vol. 46, no. 6, pp. 22-24, 2013.
2013.
[12] C. Lynch, Big Data: How do your data grow?, Nature, Vol. [31] Q. Liu, J. Peng, and A. Ihler, Variational Inference for
455, No. 7209, pp. 28-29, 2008.
Crowdsourcing, Advances in Neural Information Processing
[13] F. Chang, J. Dean, S. Ghemawat, et al, Bigtable: A distributed
Systems, 2012.
storage system for structured data, ACM Transactions on [32] D. R.Karger, S. Oh, and D. Shah, Iterative Learning for ReliaComputer Systems, Vol. 26, No.4, 2008.
ble Crowdsourcing Systems, NIPS 2011.
[14] M. Joglekar, H. Garcia-Molina, and A. Parameswaran, Evalu- [33] J. M. Rzeszotarski, and A. Kittur, Instrumenting the crowd:
ating the crowd with confidence, Proceedings of the 19th
using implicit behavioral measures to predict task performance,
ACM SIGKDD international conference on Knowledge discovProceedings of the 24th annual ACM symposium on User interery and data mining, ACM, pp. 686-694, 2013.
face software and technology, pp. 13-22, 2011.
[15] J. Zhang, T. Li, and Y. Pan, Parallel rough set based [34] R. Snow, B. OConnor, D. Jurafsky, et al, Cheap and fastbut
knowledge acquisition using MapReduce from big data , Bigis it good?: evaluating non-expert annotations for natural lanMine, pp. 20-27, 2012.
guage tasks, In EMNLP, pp. 254-263, 2008.
[16] J. Dean, and S. Ghemawat, MapReduce: Simplified data pro- [35] J. Whitehill, P.Ruvolo, T Wu, et al, Whose vote should count
cessing on large clusters, Communications of the ACM, vol.
more: Optimal integration of labels from labelers of unknown
51, no.1, pp. 107-113, 2005.
expertise, Advances in Neural Information Processing Sys[17] D. Hastorun, M. Jampani, G. Kakulapati, et al, Dynamo: Amatems, pp. 2035-2043, 2009.
zons highly available key-value store, In: Proceedings of the [36] V. C. Raykar, S. Yu, L. H.Zhao, et al, Supervised learning from
21st ACM Symposium on Operating Systems Principles, pp.
multiple experts: whom to trust when everyone lies a bit, In
205-220, 2007.
ICML, pp. 889-896, 2009.
[18] M. Isard, M. Budiu, Y. Yu, et al, Dryad: Distributed data- [37] X. Liu, M. Lu, B. C. Ooi, et al, Cdas: a crowdsourcing data
parallel programs from sequential build-ing blocks, European
analytics system, Eprint Arxiv, vol. 5, no. 10, pp.1040-1051,
Conference on Computer Systems, pp. 59-72, 2007.
2012.
[19] J. Wang, T. Kraska, M. J. Franklin, et al, CrowdER: [38] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likecrowdsourcing entity resolution, Proceedings of the VLDB
lihood from incomplete data via the em algorithm, JOURNAL
Endowment, vol. 5, no. 11, pp. 1483-1494, 2012.
OF THE ROYAL STATISTICAL SOCIETY, SERIES B, vol. 39, no.
[20] N. Maisonneuve, and B. Chopard, Crowdsourcing Satellite
1, pp. 1-38, 1977.
Imagery Analysis: Study of Parallel and Iterative Models, [39] P. Dawid, A.M. Skene , A.P. Dawidt, et al, Maximum likeliGIScience, pp. 116-131, 2012.
hood estimation of observer error-rates using the em algorithm,
[21] S. Wu, X. Wang, S. Wang, et al, K Anonymity for CrowdsourcApplied Statistics, vol. 28, no. 1, pp. 20-28, 1979.
ing Database, IEEE Transactions on Knowledge and Data En- [40] M. R. Gupta, and Y. Chen, Theory and use of the em algogineering, vol. 26, no. 9, pp. 2207-2221, 2014.
rithm, Foundations & Trends in Signal Processing, vol. 4, no.
[22] D. Vakharia and M. Lease, Beyond AMT: An Analysis of
3, pp.223-296, 2010.
Crowd Work Platforms, Eprint Arxiv, 2013.
[41] G. J. McLachlan, and T. Krishnan, The EM Algorithm and
[23] Y. Yan, R. Rosales, G. Fung, et al, Active Learning from
Extensions (Wiley Series in Probability and Statistics), Journal
Crowds, Proceedings of the 28 th International Conference on
of Classification, vol. 15, no. 1, pp. 154-156, 2007.
Machine Learning, 2011.
[42] A. Ramesh, A. Parameswaran, H. Garcia-Molina, et al, Identi[24] F. L. Wauthier, and M. I. Jordan, Bayesian Bias Mitigation for
fying reliable workers swiftly, Infolab technical report, StanCrowdsourcing, In book: Advances in Neural Information
ford University, 2012.
Processing Systems 24, Publisher: MIT Press, pp.1800-1808.
[43] P. Welinder, and P. Perona, Online crowdsourcing: rating
[25] Y. Yan, R. Rosales, G. Fung, et al, Modeling annotator expernotators and obtaining cost-effective labels, Computer Vision
tise: Learning when everybody knows a bit of something,
and Pattern Recognition Workshops (CVPRW), 2010 IEEE
Journal of Machine Learning Research, pp. 932-939, 2010.
Computer Society Conference on IEEE, pp. 25-32, 2010.
[26] X. Zhang, Z. Yang, C. Wu, et al, Robust Trajectory Estimation [44] P. G. Ipeirotis, F. Provost, and J. Wang, Quality management
for Crowdsourcing-Based Mobile Applications, IEEE Transacon amazon mechanical turk, Hcomp Proceedings of the Acm
tions on Parallel and Distributed Systems, vol. 25, no. 7, pp.
Sigkdd Workshop on Human Computation, pp. 64-67, 2010.
1876-1885, 2014.
[45] Y. Tong, C. C. Cao, C. J. Zhang, et al, CrowdCleaner: Data
[27] C. Callison-Burch, Fast, Cheap, and Creative: Evaluating
Cleaning for Multi-version Data on the Web via CrowdsourcTranslation Quality Using Amazons Mechanical Turk, Emnlp,
ing, IEEE 30th International Conference on Data Engineering
vol. 1, pp. 286-295, 2009.
(ICDE), pp. 1182-1185, 2014.
[28] J. Le, A. Edmonds, V. Hester, et al, Ensuring quality in [46] J. Dean, and S. Ghemawat, MapReduce: simplified data procrowdsourced search relevance evaluation: The effects of traincessing on large clusters, Communications of the ACM, vol. 51,
ing question distribution, In SIGIR 2010 workshop on
no. 1, pp. 107-113, 2008.
crowdsourcing for search evaluation, pp. 2126, 2010.
[47] J. Dean, and S. Ghemawat, MapReduce: a flexible data pro[29] E. Kamar, S. Hacker, and E. Horvitz, Combining human and
cessing tool, Communications of the ACM, vol. 53, no. 1, pp.
machine intelligence in large-scale crowdsourcing, In Proceed72-77, 2010.
Depeng Dang received his Ph.D. degree in

Computer Science and Technology from
Huazhong University of Science and Technology, China, in 2003. From Jul. 2003 to
Jun. 2005, he did his postdoctoral research
in the Department of Computer Science and
Technology, Tsinghua University, China.
Now, he is a full professor of the College of
Information Science and Technology, Beijing
Normal University, China. Up to now, he has
chaired three NSFC projects and been supported by Program for
New Century Excellent Talents in University. His research interests
include parallel and distributed computing, crowdsource computing
and Big Data.
Ying Liu received her Bachelor's degree in
Software Engineering from Northeast Normal
University, China, in 2012, and Master's
degree in Computer Science and Technology
from Beijing Normal University, China, in
2015. Her research interests include distributed parallel computing, worker quality control and evaluation in crowdsourcing systems.
Xiaoran Zhang is a postgraduate student at
College of Information Science and Technology, Beijing Normal University, China. She
received her Bachelor's degree in Software
Engineering from Tianjin Polytechnic University, China, in 2013. Her research interests
include distributed parallel computing,
crowdsource computing, Big Data and
MapReduce.
Shihang Huang received his Bachelor's and
Master's degree in Computer Science and
Technology from Beijing Normal University,
China, in 2012 and 2015 respectively. His
research interests include distributed parallel
computing, crowdsourcing systems, Big Data
and MapReduce.
11

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation

A Crowdsourcing Worker Quality Evaluation Algorithm

rowdsourcing is a distributed problem-solving and

D.Dang, Y.Liu, X.Zhang and S.Huang are with College of Information

Crowdsourcers almost release tasks at all times due to

1045-9219 (c) 2015 IEEE. Personal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, TPDS-2014-11-0822

directions of our work.

3 WORKER QUALITY EVALUATION ALGORITHM

3.1 M-1 Algorithm

Definition 5. The variable Q ij represents the expecta-

probability that wi choose one of the wrong options.

the probability that worker wi and worker wj choose the

We can calculate A2 , A3 in the same way. Algorithm 1

calculate the value of , . = , =

then formulate the following equations for and

3.2 Multi-worker Evaluation Scheme Based on M-1

Fig. 1. The process of multi-worker evaluation scheme.

quality of multiple workers is a more practical problem

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, TPDS-2014-11-0822

calculate the accuracy rate , (+)% , (+)% of

3.3 M-X Algorithm

Finally, for each worker, we gather up the accuracy on all

<W id , Tid , Pid , Ptype, Sid >

<Tid +Pid , W id +Ptype+Sid >

<W id +Tid , A 1 +A 2 +A 3 >

<W i +W j +W k +Tid , A i +A j +A k >

to shuffle and assigns the worker who owns the same

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, TPDS-2014-11-0822

accuracy and effectiveness of the algorithm.

accuracy obtained from our algorithm is more close to the

ours. Fig. 3 shows the comparative results. The worker

We record the worker accuracy of 100 tasks calculated by

Progressive significant (Bilateral)

Most extreme difference

Progressive significant (Bilateral)

Most extreme difference

problems in a task on the effectiveness of the algorithm,

method of interval random sampling to randomly select 10

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, TPDS-2014-11-0822

Progressive significant (Bilateral)

Most extreme difference

can obtain N groups according to the algorithm. And the

serves the performance of the MRM-X algorithm when

Fig. 6. The variation of calculation time with a varying number of

can guarantee the calculated performance in a big data

Fig. 7. The variation of calculation time with a varying number of nodes.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, TPDS-2014-11-0822

tier for innovation, competition, and productivity, 2011.

Depeng Dang received his Ph.D. degree in

You might also like