• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

DMCA

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers

Cached

  • Download as a PDF

Download Links

  • [pages.stern.nyu.edu]
  • [archive.nyu.edu]
  • [www.stern.nyu.edu]
  • [storm.cis.fordham.edu]
  • [archive.nyu.edu]
  • [archive.nyu.edu]
  • [archive.nyu.edu]
  • [misrc.csom.umn.edu]
  • [www.misrc.umn.edu]
  • [archive.nyu.edu]
  • [www.misrc.umn.edu]
  • [storm.cis.fordham.edu]
  • [pages.stern.nyu.edu]
  • [pages.stern.nyu.edu]
  • [csce.uark.edu]
  • [www.ipeirotis.com]

  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by Victor S. Sheng , Foster Provost , Panagiotis G. Ipeirotis
Citations:252 - 12 self
  • Summary
  • Citations
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

BibTeX

@MISC{Sheng_getanother,
    author = {Victor S. Sheng and Foster Provost and Panagiotis G. Ipeirotis},
    title = {Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers},
    year = {}
}

Share

Facebook Twitter Reddit Bibsonomy

OpenURL

 

Abstract

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon’s Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.

Keyphrases

data mining using multiple    improving data quality    noisy labelers    traditional setting    considerable advantage    robust technique    different notion    everything multiple time    low-cost labeling    certain label-quality cost regime    label quality    present repeated-labeling strategy    unlabeled data    model quality    multiple label    simple strategy    supervised induction    selective acquisition    repeated acquisition    several main result    amazon mechanical turk    small task    data quality    training label    bottom line    data point    unlabeled part    low cost    data item    less-than-expert labeling   

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University