Results 1 - 10
of
235
Text Classification from Labeled and Unlabeled Documents using EM
- Machine Learning
, 1999
"... . This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large qua ..."
Abstract
-
Cited by 632 (16 self)
- Add to MetaCart
. This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve ...
Inductive Learning Algorithms and Representations for Text Categorization
, 1998
"... Text categorization – the assignment of natural language texts to one or more predefined categories based on their content – is an important component in many information organization and management tasks. We compare the effectiveness of five different automatic learning algorithms for text categori ..."
Abstract
-
Cited by 419 (9 self)
- Add to MetaCart
Text categorization – the assignment of natural language texts to one or more predefined categories based on their content – is an important component in many information organization and management tasks. We compare the effectiveness of five different automatic learning algorithms for text categorization in terms of learning speed, realtime classification speed, and classification accuracy. We also examine training set size, and alternative document representations. Very accurate text classifiers can be learned automatically from training examples. Linear Support Vector Machines (SVMs) are particularly promising because they are very accurate, quick to train, and quick to evaluate. 1.1 Keywords Text categorization, classification, support vector machines, machine learning, information management.
Combating web spam with trustrank
- In VLDB
, 2004
"... Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine’s results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semi-automatically separate reputable, good pages fr ..."
Abstract
-
Cited by 220 (2 self)
- Add to MetaCart
Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine’s results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semi-automatically separate reputable, good pages from spam. We first select a small set of seed pages to be evaluated by an expert. Once we manually identify the reputable seed pages, we use the link structure of the web to discover other pages that are likely to be good. In this paper we discuss possible ways to implement the seed selection and the discovery of good pages. We present results of experiments run on the World Wide Web indexed by AltaVista and evaluate the performance of our techniques. Our results show that we can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites. 1
Bursty and Hierarchical Structure in Streams
, 2002
"... A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade aw ..."
Abstract
-
Cited by 196 (2 self)
- Add to MetaCart
A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise --- that the appearance of a topic in a document stream is signaled by a "burst of activity," with certain features rising sharply in frequency as the topic emerges.
Support vector machines for spam categorization
- IEEE TRANSACTIONS ON NEURAL NETWORKS
, 1999
"... We study the use of support vector machines (SVM’s) in classifying e-mail as spam or nonspam by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees. These four algorithms were tested on two different data sets: one data set where the number of features ..."
Abstract
-
Cited by 178 (2 self)
- Add to MetaCart
We study the use of support vector machines (SVM’s) in classifying e-mail as spam or nonspam by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees. These four algorithms were tested on two different data sets: one data set where the number of features were constrained to the 1000 best features and another data set where the dimensionality was over 7000. SVM’s performed best when using binary features. For both data sets, boosting trees and SVM’s had acceptable test performance in terms of accuracy and speed. However, SVM’s had significantly less training time.
Attention-Sensitive Alerting
, 1998
"... We introduce utility-directed procedures for mediating the flow of potentially distracting alerts and communications to computer users. We present models and inference procedures that balance the context-sensitive costs of deferring alerts with the cost of interruption. We describe the challen ..."
Abstract
-
Cited by 165 (22 self)
- Add to MetaCart
We introduce utility-directed procedures for mediating the flow of potentially distracting alerts and communications to computer users. We present models and inference procedures that balance the context-sensitive costs of deferring alerts with the cost of interruption. We describe the challenge of reasoning about such costs under uncertainty via an analysis of user activity and the content of notifications. After introducing principles of attention-sensitive alerting, we focus on the problem of guiding alerts about email messages. We dwell on the problem of inferring the expected criticality of email and discuss work on the Priorities system, centering on prioritizing email by criticality and modulating the communication of notifications to users about the presence and nature of incoming email. 1 Introduction Multitasking computer systems provide great value to users by hosting numerous processes and applications simultaneously. However, the ongoing execution of mu...
Detecting spam web pages through content analysis
- In Proceedings of the World Wide Web conference
, 2006
"... In this paper, we continue our investigations of “web spam”: the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatica ..."
Abstract
-
Cited by 110 (3 self)
- Add to MetaCart
In this paper, we continue our investigations of “web spam”: the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).
An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages
- In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
, 2000
"... The growing problem of unsolicited bulk e-mail, also known as "spare", has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has recently been proposed, whereby a Naiv ..."
Abstract
-
Cited by 74 (2 self)
- Add to MetaCart
The growing problem of unsolicited bulk e-mail, also known as "spare", has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has recently been proposed, whereby a Naive Bayesian classifier is trained automatically to detect spam messages. We test this approach on a large collection of personal e-mail messages, which we make publicly available in "encrypted " form contributing towards standard benchmarks. We introduce appropriate cost-sensitive measures, investigating at the same time the effect of attribute-set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments. Finally, the Naive Bayesian filter is compared, in terms of performance, to a filter that uses keyword patterns, and which is part of a widely used e-mail reader. Keywords filtering/routing; text categorization; machine learning and IR; evaluation (general); test collections I.
Boosting trees for anti-spam email filtering
- In Proceedings of RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG
, 2001
"... This paper describes a set of comparative experiments for the problem of automatically ltering unwanted electronic mail messages. Several variants of the AdaBoost algorithm with con dence{ rated predictions (Schapire & Singer 99) have been applied, which di er in the complexity of the base learners ..."
Abstract
-
Cited by 73 (0 self)
- Add to MetaCart
This paper describes a set of comparative experiments for the problem of automatically ltering unwanted electronic mail messages. Several variants of the AdaBoost algorithm with con dence{ rated predictions (Schapire & Singer 99) have been applied, which di er in the complexity of the base learners considered. Two main conclusions can be drawn from our experiments: a) The boosting{based methods clearly outperform the baseline learning algorithms (Naive Bayes and Induction of Decision Trees) on the PU1 corpus, achieving very high levels of the F1 measure � b) Increasing the complexity of the base learners allows to obtain better \high{precision " classi ers, which isavery important issue when misclassication costs are considered. 1
Adversarial Classification
- IN KDD
, 2004
"... Essentially all data mining algorithms assume that the datagenerating process is independent of the data miner's activities. However, in many domains, including spam detection, intrusion detection, fraud detection, surveillance and counter-terrorism, this is far from the case: the data is actively m ..."
Abstract
-
Cited by 71 (0 self)
- Add to MetaCart
Essentially all data mining algorithms assume that the datagenerating process is independent of the data miner's activities. However, in many domains, including spam detection, intrusion detection, fraud detection, surveillance and counter-terrorism, this is far from the case: the data is actively manipulated by an adversary seeking to make the classifier produce false negatives. In these domains, the performance of a classifier can degrade rapidly after it is deployed, as the adversary learns to defeat it. Currently the only solution to this is repeated, manual, ad hoc reconstruction of the classifier. In this paper we develop a formal framework and algorithms for this problem. We view classification as a game between the classifier and the adversary, and produce a classifier that is optimal given the adversary's optimal strategy. Experiments in a spam detection domain show that this approach can greatly outperform a classifier learned in the standard way, and (within the parameters of the problem) automatically adapt the classifier to the adversary's evolving manipulations.

