Results 1 - 10
of
26
Improved part-of-speech tagging for online conversational text with word clusters
- In Proceedings of NAACL
, 2013
"... We consider the problem of part-of-speech tagging for informal, online conversational text. We systematically evaluate the use of large-scale unsupervised word clustering and new lexical features to improve tagging accuracy. With these features, our system achieves state-of-the-art tagging results o ..."
Abstract
-
Cited by 95 (7 self)
- Add to MetaCart
We consider the problem of part-of-speech tagging for informal, online conversational text. We systematically evaluate the use of large-scale unsupervised word clustering and new lexical features to improve tagging accuracy. With these features, our system achieves state-of-the-art tagging results on both Twitter and IRC POS tagging tasks; Twitter tagging is improved from 90 % to 93% accuracy (more than 3 % absolute). Qualitative analysis of these word clusters yields insights about NLP and linguistic phenomena in this genre. Additionally, we contribute the first POS annotation guidelines for such text and release a new dataset of English language tweets annotated using these guidelines. Tagging software, annotation guidelines, and large-scale word clusters are available at:
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
"... Part-of-speech information is a pre-requisite in many NLP algorithms. However, Twitter text is difficult to part-of-speech tag: it is noisy, with linguistic errors and idiosyncratic style. We present a detailed error analysis of existing taggers, motivating a series of tagger augmentations which are ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
Part-of-speech information is a pre-requisite in many NLP algorithms. However, Twitter text is difficult to part-of-speech tag: it is noisy, with linguistic errors and idiosyncratic style. We present a detailed error analysis of existing taggers, motivating a series of tagger augmentations which are demonstrated to improve performance. We identify and evaluate techniques for improving English part-of-speech tagging performance in this genre. Further, we present a novel approach to system combination for the case where available taggers use different tagsets, based on voteconstrained bootstrapping with unlabeled data. Coupled with assigning prior probabilities to some tokens and handling of unknown words and slang, we reach 88.7 % tagging accuracy (90.5 % on development data). This is a new high in PTB-compatible tweet part-of-speech tagging, reducing token error by 26.8 % and sentence error by 12.2%. The model, training data and tools are made available. 1
Automatically Constructing a Normalisation Dictionary for Microblogs
"... Microblog normalisation methods often utilise complex models and struggle to differentiate between correctly-spelled unknown words and lexical variants of known words. In this paper, we propose a method for constructing a dictionary of lexical variants of known words that facilitates lexical normali ..."
Abstract
-
Cited by 33 (4 self)
- Add to MetaCart
Microblog normalisation methods often utilise complex models and struggle to differentiate between correctly-spelled unknown words and lexical variants of known words. In this paper, we propose a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution (e.g. tomorrow for tmrw). We use context information to generate possible variant and normalisation pairs and then rank these by string similarity. Highlyranked pairs are selected to populate the dictionary. We show that a dictionary-based approach achieves state-of-the-art performance
Separating Fact from Fear: Tracking Flu Infections on Twitter
"... Twitter has been shown to be a fast and reliable method for disease surveillance of common illnesses like influenza. However, previous work has relied on simple content analysis, which conflates flu tweets that report infection with those that express concerned awareness of the flu. By discriminatin ..."
Abstract
-
Cited by 30 (6 self)
- Add to MetaCart
Twitter has been shown to be a fast and reliable method for disease surveillance of common illnesses like influenza. However, previous work has relied on simple content analysis, which conflates flu tweets that report infection with those that express concerned awareness of the flu. By discriminating these categories, as well as tweets about the authors versus about others, we demonstrate significant improvements on influenza surveillance using Twitter. 1
From News to Comment: Resources and Benchmarks for Parsing the Language of Web 2.0
"... We investigate the problem of parsing the noisy language of social media. We evaluate four Wall-Street-Journal-trained statistical parsers (Berkeley, Brown, Malt and MST) on a new dataset containing 1,000 phrase structure trees for sentences from microblogs (tweets) and discussion forum posts. We co ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
We investigate the problem of parsing the noisy language of social media. We evaluate four Wall-Street-Journal-trained statistical parsers (Berkeley, Brown, Malt and MST) on a new dataset containing 1,000 phrase structure trees for sentences from microblogs (tweets) and discussion forum posts. We compare the four parsers on their ability to produce Stanford dependencies for these Web 2.0 sentences. We find that the parsers have a particular problem with tweets and that a substantial part of this problem is related to POS tagging accuracy. We attempt three retraining experiments involving Malt, Brown and an in-house Berkeley-style parser and obtain a statistically significant improvement for all three parsers. 1
A broadcoverage normalization system for social media language
, 2012
"... Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system coverage, i.e., for any user-created nonstandard term, the system should be able to restore the correct word within its top n output candidates. In this paper, we propose a cognitivelydriven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. The system was evaluated on both word- and messagelevel using four SMS and Twitter data sets. Results show that our system achieves over 90 % word-coverage across all data sets (a 10 % absolute increase compared to state-ofthe-art); the broad word-coverage can also successfully translate into message-level performance gain, yielding 6 % absolute increase compared to the best prior approach. 1
Leadline: Interactive visual analysis of text data through event identification and exploration
- In Proceedings of the 2012 IEEE Conference on Visual Analytics Science and Technology (VAST), VAST ’12
, 2012
"... locations mentioned in news articles related to the president. Left view: highlighted bursts indicate events that are related to President Obama. Text data such as online news and microblogs bear valuable insights regarding important events and responses to such events. Events are inherently tempora ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
(Show Context)
locations mentioned in news articles related to the president. Left view: highlighted bursts indicate events that are related to President Obama. Text data such as online news and microblogs bear valuable insights regarding important events and responses to such events. Events are inherently temporal, evolving over time. Existing visual text anal-ysis systems have provided temporal views of changes based on topical themes extracted from text data. But few have associated topical themes with events that cause the changes. In this paper, we propose an interactive visual analytics system, LeadLine, to au-tomatically identify meaningful events in news and social media data and support exploration of the events. To characterize events, LeadLine integrates topic modeling, event detection, and named en-tity recognition techniques to automatically extract information re-garding the investigative 4 Ws: who, what, when, and where for each event. To further support analysis of the text corpora through events, LeadLine allows users to interactively examine meaning-ful events using the 4 Ws to develop an understanding of how and why. Through representing large-scale text corpora in the form of meaningful events, LeadLine provides a concise summary of the
A Beam-Search Decoder for Normalization of Social Media Text with Application to Machine Translation
"... Social media texts are written in an informal style, which hinders other natural language processing (NLP) applications such as machine translation. Text normalization is thus important for processing of social media text. Previous work mostly focused on normalizing words by replacing an informal wo ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Social media texts are written in an informal style, which hinders other natural language processing (NLP) applications such as machine translation. Text normalization is thus important for processing of social media text. Previous work mostly focused on normalizing words by replacing an informal word with its formal form. In this paper, to further improve other downstream NLP applications, we argue that other normalization operations should also be performed, e.g., missing word recovery and punctuation correction. A novel beam-search decoder is proposed to effectively integrate various normalization operations. Empirical results show that our system obtains statistically significant improvements over two strong baselines in both normalization and translation tasks, for both Chinese and English.
A Framework for (Under)specifying Dependency Syntax without Overloading Annotators
"... We introduce a framework for lightweight dependency syntax annotation. Our formalism builds upon the typical representation for unlabeled dependencies, permitting a simple notation and annotation workflow. Moreover, the formalism encourages annotators to underspecify parts of the syntax if doing so ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
We introduce a framework for lightweight dependency syntax annotation. Our formalism builds upon the typical representation for unlabeled dependencies, permitting a simple notation and annotation workflow. Moreover, the formalism encourages annotators to underspecify parts of the syntax if doing so would streamline the annotation process. We demonstrate the efficacy of this annotation on three languages and develop algorithms to evaluate and compare underspecified annotations. 1
A Participant-based Approach for Event Summarization Using Twitter Streams
"... Twitter offers an unprecedented advantage on live reporting of the events happening around the world. However, summarizing the Twitter event has been a challenging task that was not fully explored in the past. In this paper, we propose a participant-based event summarization approach that “zooms-in ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Twitter offers an unprecedented advantage on live reporting of the events happening around the world. However, summarizing the Twitter event has been a challenging task that was not fully explored in the past. In this paper, we propose a participant-based event summarization approach that “zooms-in ” the Twitter event streams to the participant level, detects the important sub-events associated with each participant using a novel mixture model that combines the “burstiness ” and “cohesiveness” properties of the event tweets, and generates the event summaries progressively. We evaluate the proposed approach on different event types. Results show that the participantbased approach can effectively capture the sub-events that have otherwise been shadowed by the long-tail of other dominant sub-events, yielding summaries with considerably better coverage than the state-of-the-art. 1