Results 1 - 10
of
210
Learning Information Extraction Rules for Semi-structured and Free Text
- Machine Learning
, 1999
"... . A wealth of on-line text information can be made available to automatic processing by information extraction (IE) systems. Each IE application needs a separate set of rules tuned to the domain and writing style. WHISK helps to overcome this knowledge-engineering bottleneck by learning text extract ..."
Abstract
-
Cited by 437 (10 self)
- Add to MetaCart
. A wealth of on-line text information can be made available to automatic processing by information extraction (IE) systems. Each IE application needs a separate set of rules tuned to the domain and writing style. WHISK helps to overcome this knowledge-engineering bottleneck by learning text extraction rules automatically. WHISK is designed to handle text styles ranging from highly structured to free text, including text that is neither rigidly formatted nor composed of grammatical sentences. Such semistructured text has largely been beyond the scope of previous systems. When used in conjunction with a syntactic analyzer and semantic tagging, WHISK can also handle extraction from free text such as news stories. Keywords: natural language processing, information extraction, rule learning 1. Information extraction As more and more text becomes available on-line, there is a growing need for systems that extract information automatically from text data. An information extraction (IE) sys...
Recommendation as Classification: Using Social and Content-Based Information in Recommendation
- In Proceedings of the Fifteenth National Conference on Artificial Intelligence
, 1998
"... Recommendation systems make suggestions about artifacts to a user. For instance, they may predict whether a user would be interested in seeing a particular movie. Social recomendation methods collect ratings of artifacts from many individuals and use nearest-neighbor techniques to make recommendatio ..."
Abstract
-
Cited by 342 (8 self)
- Add to MetaCart
Recommendation systems make suggestions about artifacts to a user. For instance, they may predict whether a user would be interested in seeing a particular movie. Social recomendation methods collect ratings of artifacts from many individuals and use nearest-neighbor techniques to make recommendations to a user concerning new artifacts. However, these methods do not use the significant amount of other information that is often available about the nature of each artifact --- such as cast lists or movie reviews, for example. This paper presents an inductive learning approach to recommendation that is able to use both ratings information and other forms of information about each artifact in predicting user preferences. We show that our method outperforms an existing social-filtering method in the domain of movie recommendations on a dataset of more than 45,000 movie ratings collected from a community of over 250 users. Introduction Recommendations are a part of everyday life. We usually...
Content-Based Book Recommending Using Learning for Text Categorization
- IN PROCEEDINGS OF THE FIFTH ACM CONFERENCE ON DIGITAL LIBRARIES
, 1999
"... Recommender systems improve access to relevant products and information by making personalized suggestions based on previous examples of a user's likes and dislikes. Most existing recommender systems use collaborative filtering methods that base recommendations on other users' preferences. ..."
Abstract
-
Cited by 334 (8 self)
- Add to MetaCart
Recommender systems improve access to relevant products and information by making personalized suggestions based on previous examples of a user's likes and dislikes. Most existing recommender systems use collaborative filtering methods that base recommendations on other users' preferences. By contrast, content-based methods use information about an item itself to make suggestions. This approach has the advantage of being able to recommend previously unrated items to users with unique interests and to provide explanations for its recommendations. We describe a content-based book recommending system that utilizes information extraction and a machine-learning algorithm for text categorization. Initial experimental results demonstrate that this approach can produce accurate recommendations.
Data mining methods for detection of new malicious executables
- In Proceedings of the IEEE Symposium on Security and Privacy
, 2001
"... A serious security threat today is malicious executables, especially new, unseen malicious executables. Many of these new malicious executables are undetectable by current anti-virus systems because they do not contain signatures for these new instances of malicious programs. These new malicious exe ..."
Abstract
-
Cited by 155 (3 self)
- Add to MetaCart
(Show Context)
A serious security threat today is malicious executables, especially new, unseen malicious executables. Many of these new malicious executables are undetectable by current anti-virus systems because they do not contain signatures for these new instances of malicious programs. These new malicious executables are created every day, and thus pose a serious security threat. We present a framework that detects new, previously unseen malicious executables. Comparing our detection methods with a traditional signature-based method, our method more than doubles the current detection rates for new malicious executables. 1
Just how mad are you? Finding strong and weak opinion clauses
- In Proceedings of AAAI
, 2004
"... identification and extraction of opinions and emotions in text. ..."
Abstract
-
Cited by 131 (2 self)
- Add to MetaCart
(Show Context)
identification and extraction of opinions and emotions in text.
Distributed search over the hidden web: Hierarchical database sampling and selection
- In VLDB
, 2002
"... Many valuable text databases on the web have non-crawlable contents that are “hidden ” behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and eff ..."
Abstract
-
Cited by 122 (13 self)
- Add to MetaCart
(Show Context)
Many valuable text databases on the web have non-crawlable contents that are “hidden ” behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically relies on statistical summaries of the database contents. Unfortunately, web-accessible text databases do not generally export content summaries. In this paper, we present an algorithm to derive content summaries from “uncooperative ” databases by using “focused query probes,” which adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. Our content summaries are the first to include absolute document frequency estimates for the database words. We also present a novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases, automatically derived during probing, to compensate for potentially incomplete content summaries. Finally, we evaluate our techniques thoroughly using a variety of databases, including 50 real web-accessible text databases. Our experiments indicate that our new content-summary construction technique is efficient and produces more accurate summaries than those from previously proposed strategies. Also, our hierarchical database selection algorithm exhibits significantly higher precision than its flat counterparts. 1
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop
, 2005
"... We present an approach to using a morphological analyzer for tokenizing and morphologically tagging (including partof-speech tagging) Arabic words in one process. We learn classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the o ..."
Abstract
-
Cited by 110 (15 self)
- Add to MetaCart
We present an approach to using a morphological analyzer for tokenizing and morphologically tagging (including partof-speech tagging) Arabic words in one process. We learn classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer. We obtain accuracy rates on all tasks in the high nineties.
Text Classification Using WordNet Hypernyms
- USE OF WORDNET IN NATURAL LANGUAGE PROCESSING SYSTEMS: PROCEEDINGS OF THE CONFERENCE, PAGES 38–44. ASSOCIATION FOR COMPUTATIONAL LINGUISTICS
, 1998
"... This paper describes experiments in Machine Learning for text classification using a new representation of text based on WordNet hypemyms. Six binary classification tasks of varying difficulty are defined, and the Ripper system is used to produce discrimination rules for each task using the ne ..."
Abstract
-
Cited by 101 (1 self)
- Add to MetaCart
This paper describes experiments in Machine Learning for text classification using a new representation of text based on WordNet hypemyms. Six binary classification tasks of varying difficulty are defined, and the Ripper system is used to produce discrimination rules for each task using the new hypernym density representation. Rules are also produced with the commonly used bag-of-words representation, incorporating no knowledge from WordNet. Experiments show
Recognizing Contextual Polarity: An Exploration of Features for Phrase-Level Sentiment Analysis
- Computational Linguistics
, 2009
"... Many approaches to automatic sentiment analysis begin with a large lexicon of words marked with their prior polarity (also called semantic orientation). However, the contextual polarity of the phrase in which a particular instance of a word appears may be quite different from the word’s prior polari ..."
Abstract
-
Cited by 95 (2 self)
- Add to MetaCart
(Show Context)
Many approaches to automatic sentiment analysis begin with a large lexicon of words marked with their prior polarity (also called semantic orientation). However, the contextual polarity of the phrase in which a particular instance of a word appears may be quite different from the word’s prior polarity. Positive words are used in phrases expressing negative sentiments, or vice versa. Also, quite often words that are positive or negative out of context are neutral in context, meaning they are not even being used to express a sentiment. The goal of this work is to automatically distinguish between prior and contextual polarity, with a focus on understanding which features are important for this task. Because an important aspect of the problem is identifying when polar terms are being used in neutral contexts, features for distinguishing between neutral and polar instances are evaluated, as well as features for distinguishing between positive and negative contextual polarity. The evaluation includes assessing the performance of features across multiple machine learning algorithms. For all learning algorithms except one, the combination of all features together gives the best performance. Another facet of the evaluation considers how the presence of neutral instances affects the performance of features for distinguishing between positive and negative polarity. These experiments show that the presence of neutral instances greatly degrades the performance of these features, and that perhaps the best way to improve performance across all polarity classes is to improve the system’s ability to identify when an instance is neutral. 1.