In the era of big data, data can often be obtained abundantly and cheaply, but providing labels for these large-scale data has always been a challenge because labeling data is expensive and time-consuming. For example, if we want to train a model for annotating the image automatically, learning algorithms need many images with known annotations as training data. Although there are large amounts of images in the internet, specialists must be hired to annotate the images as training data.
Crowdsourcing has been an effective and efficient paradigm for providing labels for large-scale data, in which users (known as Taskmasters) submit their "micro-tasks" in the internet that can be completed by voluntary workers in exchange for small monetary payments. Once the tasks are posted by the taskmaster, thousands of workers have internet access to them, and the taskmaster can collect labels for these tasks in a short period of time.
Not all voluntary workers in the crowd are perfect, some workers may provide wrong labels for these tasks. To improve quality and reliability, common wisdom is to add redundancy into the labels, i.e., each task is presented with multiple workers and the ground-truth label is expected to be inferred from these multiple labels by intelligent algorithms. Now, Wei Wang and Zhi-Hua Zhou in Nanjing University presented a theoretical analysis of label quality in crowdsourcing and gave an upper bound on the error rate of the labels. They also analyzed the workers based on their completed tasks and provided criterions for evaluating their qualities. These theoretical results can help to eliminate low-quality workers from the crowd, improve label quality and reduce label cost.
This research was published in Science China: Information Sciences, 2015-11 issue.
See the article:
WANG Wei, ZHOU Zhi-Hua*. Crowdsourcing label quality: a theoretical analysis. SCIENCE CHINA Information Sciences, 2015, 58(11): 112103(12)
Science China Press