Results 1 -
6 of
6
Open Domain Event Extraction from Twitter
"... Tweets are the most up-to-date and inclusive stream of information and commentary on current events, but they are also fragmented and noisy, motivating the need for systems that can extract, aggregate and categorize important events. Previous work on extracting structured representations of events h ..."
Abstract
-
Cited by 52 (3 self)
- Add to MetaCart
(Show Context)
Tweets are the most up-to-date and inclusive stream of information and commentary on current events, but they are also fragmented and noisy, motivating the need for systems that can extract, aggregate and categorize important events. Previous work on extracting structured representations of events has focused largely on newswire text; Twitter’s unique characteristics present new challenges and opportunities for open-domain event extraction. This paper describes TwiCal— the first open-domain event-extraction and categorization system for Twitter. We demonstrate that accurately extracting an open-domain calendar of significant events from Twitter is indeed feasible. In addition, we present a novel approach for discovering important event categories and classifying extracted events based on latent variable models. By leveraging large volumes of unlabeled data, our approach achieves a 14 % increase in maximum F1 over a supervised baseline. A continuously updating demonstration of our system can be viewed at
Short message communications: users, topics, and inlanguage processing
- In Proceedings of the 2nd ACM Symposium on Computing for Development
, 2012
"... This paper investigates three dimensions of cross-domain analysis for humanitarian information processing: citizen re-porting vs organizational reporting; Twitter vs SMS; and English vs non-English communications. Short messages sent during the response to the recent earthquake in Haiti and floods i ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
This paper investigates three dimensions of cross-domain analysis for humanitarian information processing: citizen re-porting vs organizational reporting; Twitter vs SMS; and English vs non-English communications. Short messages sent during the response to the recent earthquake in Haiti and floods in Pakistan are analyzed. It is clear that SMS and Twitter were used very differently at the time, by different groups of people. SMS was primarily used by individuals on the ground while Twitter was primarily used by the in-ternational community. Turning to semi-automated strate-gies that employ natural language processing, it is found that English-optimal strategies do not carry over to Urdu or Kreyol, especially with regards to subword variation. Look-ing at machine-learning models that attempt to combine both Twitter and SMS, it is found that the cross-domain prediction accuracy is very poor, but some loss in accu-racy can be overcome by learning prior distributions over the sources. It is concluded that there is only limited util-ity in treating SMS and Twitter as equivalent information sources – perhaps much less than the relatively large number of recent Twitter-focused papers would indicate. 1.
i The Regression Model of Machine Translation
, 2011
"... for ful llment of the requirements for the degree of ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
for ful llment of the requirements for the degree of
Accurate Unsupervised Joint Named-Entity Extraction from Unaligned Parallel Text
"... We present a new approach to named-entity recognition that jointly learns to identify named-entities in parallel text. The system generates seed candidates through local, cross-language edit likelihood and then bootstraps to make broad predictions across both languages, optimizing combined contextua ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
We present a new approach to named-entity recognition that jointly learns to identify named-entities in parallel text. The system generates seed candidates through local, cross-language edit likelihood and then bootstraps to make broad predictions across both languages, optimizing combined contextual, word-shape and alignment models. It is completely unsupervised, with no manually labeled items, no external resources, only using parallel text that does not need to be easily alignable. The results are strong, with F> 0.85 for purely unsupervised namedentity recognition across languages, compared to just F = 0.35 on the same data for supervised cross-domain named-entity recognition within a language. A combination of unsupervised and supervised methods increases the accuracy to F = 0.88. We conclude that we have found a viable new strategy for unsupervised named-entity recognition across lowresource languages and for domain-adaptation within high-resource languages. 1
Pivot-based Triangulation for Low-Resource Languages
"... This paper conducts a comprehensive study on the use of triangulation for four very low-resource languages: Mawukakan and Maninkakan, Haitian Kreyol and Malagasy. To the best of our knowledge, ours is the first effective translation system for the first two of these languages. We improve translation ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper conducts a comprehensive study on the use of triangulation for four very low-resource languages: Mawukakan and Maninkakan, Haitian Kreyol and Malagasy. To the best of our knowledge, ours is the first effective translation system for the first two of these languages. We improve translation quality by adding data using pivot languages and exper-imentally compare previously proposed triangulation design options. Furthermore, since the low-resource language pair and pivot language pair data typically come from very different domains, we use insights from domain adaptation to tune the weighted mixture of direct and pivot based phrase pairs to improve translation quality. 1