• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

DMCA

Dirt Cheap Web-Scale Parallel Text from the Common Crawl

Cached

  • Download as a PDF

Download Links

  • [cs.jhu.edu]
  • [www.cs.jhu.edu]
  • [cis.upenn.edu]
  • [www.cis.upenn.edu]
  • [www.cs.jhu.edu]
  • [cis.upenn.edu]
  • [cs.jhu.edu]
  • [www.cs.jhu.edu]
  • [wing.comp.nus.edu.sg]
  • [aclweb.org]
  • [www.aclweb.org]
  • [aclweb.org]
  • [wing.comp.nus.edu.sg]

  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by Jason R. Smith , Herve Saint-amand , Chris Callison-burch , Magdalena Plamada , Adam Lopez
Citations:15 - 4 self
  • Summary
  • Citations
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

BibTeX

@MISC{Smith_dirtcheap,
    author = {Jason R. Smith and Herve Saint-amand and Chris Callison-burch and Magdalena Plamada and Adam Lopez},
    title = {Dirt Cheap Web-Scale Parallel Text from the Common Crawl},
    year = {}
}

Share

Facebook Twitter Reddit Bibsonomy

OpenURL

 

Abstract

Parallel text is the fuel that drives modern machine translation systems. The Web is a comprehensive source of preexisting parallel text, but crawling the entire web is impossible for all but the largest companies. We bring web-scale parallel text to the masses by mining the Common Crawl, a public Web crawl hosted on Amazon’s Elastic Cloud. Starting from nothing more than a set of common two-letter language codes, our open-source extension of the STRAND algorithm mined 32 terabytes of the crawl in just under a day, at a cost of about $500. Our large-scale experiment uncovers large amounts of parallel text in dozens of language pairs across a variety of domains and genres, some previously unavailable in curated datasets. Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU. We make our code and data available for other researchers seeking to mine this rich new data resource. 1 1

Keyphrases

common crawl    dirt cheap web-scale parallel text    parallel text    curated datasets    web-scale parallel text    modern machine translation system    strand algorithm    common two-letter language code    language pair    different language pair    news domain    open-source extension    minimal cleaning    elastic cloud    rich new data resource    large-scale experiment uncovers large amount    open domain test    data boost translation performance    entire web    public web crawl    comprehensive source   

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University