• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

A natural law of succession (1995)

by E S Ristad
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 39
Next 10 →

Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty

by Evgeniy Gabrilovich, Susan Dumais, Eric Horvitz - In WWW2004 , 2004
"... We present a principled methodology for filtering news stories by formal measures of information novelty, and show how the techniques can be used to custom-tailor newsfeeds based on information that a user has already reviewed. We review methods for analyzing novelty and then describe Newsjunkie, a ..."
Abstract - Cited by 96 (7 self) - Add to MetaCart
We present a principled methodology for filtering news stories by formal measures of information novelty, and show how the techniques can be used to custom-tailor newsfeeds based on information that a user has already reviewed. We review methods for analyzing novelty and then describe Newsjunkie, a system that personalizes news for users by identifying the novelty of stories in the context of stories they have already reviewed. Newsjunkie employs novelty-analysis algorithms that represent articles as words and named entities. The algorithms analyze inter- and intra- document dynamics by considering how information evolves over time from article to article, as well as within individual articles. We review the results of a user study undertaken to gauge the value of the approach over legacy time-based review of newsfeeds, and also to compare the performance of alternate distance metrics that are used to estimate the dissimilarity between candidate new articles and sets of previously reviewed articles.

Data mining for hypertext: A tutorial survey

by Soumen Chakrabarti , 2000
"... With over 800 million pages covering most areas of human endeavor, the World-wide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searchin ..."
Abstract - Cited by 94 (0 self) - Add to MetaCart
With over 800 million pages covering most areas of human endeavor, the World-wide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searching via keyword queries. This process is often tentative and unsatisfactory. Better support is needed for expressing one's information need and dealing with a search result in more structured ways than available now. Data mining and machine learning have significant roles to play towards this end. In this paper we willsurvey recent advances in learning and mining problems related to hypertext in general and the Web in particular. We will review the continuum of supervised to semi-supervised to unsupervised learning problems, highlight the specific challenges which distinguish data mining in the hypertext domain from data mining in the context of data warehouses, and summarize the key areas of recent and ongoing research.
(Show Context)

Citation Context

... does not occur, and so Pr(cjd) will also be zero even if a test document d contains even one such term. To avoid this in practice, the ML estimate is often replaced by the Laplace corrected estimate =-=[47, 61]-=-: ` c;t = 1 + P d2Dc n(d; t) jT j + P P d2Dc n(d;s) : (3) Since in the binary case there are two outcomes instead of T outcomes, the corresponding correction is OE c;t = 1 + jfd 2 Dc : t 2 dgj 2 + jDc...

Using taxonomy, discriminants, and signatures for navigating in text databases

by Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, Prabhakar Raghavan - In Proceedings of the 23rd VLDB Conference , 1997
"... We explore how to organize a text database hierarchically to aid better searching and browsing. We propose to exploit the natural hierarchy of topics, or taxonomy, that many corpora,suchas internet directories, digital libraries, and patent databases enjoy. In our system, the user navigates through ..."
Abstract - Cited by 88 (5 self) - Add to MetaCart
We explore how to organize a text database hierarchically to aid better searching and browsing. We propose to exploit the natural hierarchy of topics, or taxonomy, that many corpora,suchas internet directories, digital libraries, and patent databases enjoy. In our system, the user navigates through the query response not as a at unstructured list, but embedded in the familiar taxonomy, and annotated with document signatures computed dynamically with respect to where the user is located at any time. Weshowhowto update such databases with new documents with high speed and accuracy. Weuse techniques from statistical pattern recognition to e ciently separate the feature words or discriminants from the noise words at each node of the taxonomy. Using these, we build a multi-level classi er. At each node, this classi er can ignore the large number of noise words in a document. Thus the classi er has a small model size and is very fast. However, owing to the use of context-sensitive features, the classi er is very accurate. We report on experiences with the Reuters newswire benchmark, the US Patent database, and web document samples from Yahoo!. 1
(Show Context)

Citation Context

...tic one: to pick a model appropriate for the task at hand. 3.2 Rare events and laws of succession The average English speaker uses about 20,000 of the 1,000,000 or more terms in an English dictionary =-=[27]-=-. In that sense, many terms that occur in documents are \rare events." This means that with reasonably small sample sets, we will see zero occurrences of many, many terms, and will still be required t...

Classification of text documents

by Y. H. Li, A. K. Jain - The Computer Journal , 1998
"... ..."
Abstract - Cited by 84 (0 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

... probabilities, P(wi |c j), will not be reasonable; if a word never appears in the given training data, its relative frequency estimate will be zero. Instead, we applied the Laplace law of succession =-=[14]-=- to estimatesP(wi |c j ). The estimate of the probability P(wi |c j ) is given as: P(wi |c j) = nij + 1 , (2) n j + k j where n j is the total number of words in class c j , nij is the number of occur...

Building Probabilistic Models for Natural Language

by Stanley F. Chen , 1996
"... ..."
Abstract - Cited by 73 (1 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...ysis described in Section 2.5 lend themselves well to improving algorithms that estimate p(O f jO) directly. There are several existing Bayesian smoothing methods (Nadas, 1984; MacKay and Peto, 1995; =-=Ristad, 1995-=-), but none perform particularly well or are in wide use. 5.1.2 Bayesian Grammar Induction In grammar induction, we have a very different situation from that found in smoothing. In smoothing, we have ...

Bayesian approaches to failure prediction for disk drives

by Greg Hamerly, Charles Elkan - In Proc. 18th ICML , 2001
"... Hard disk drive failures are rare but are often costly. The ability to predict failures is important to consumers, drive manufacturers, and computer system manufacturers alike. In this paper we investigate the abilities of two Bayesian methods to predict disk drive failures based on measurements of ..."
Abstract - Cited by 47 (2 self) - Add to MetaCart
Hard disk drive failures are rare but are often costly. The ability to predict failures is important to consumers, drive manufacturers, and computer system manufacturers alike. In this paper we investigate the abilities of two Bayesian methods to predict disk drive failures based on measurements of drive internal conditions. We first view the problem from an anomaly detection stance. We introduce a mixture model of naive Bayes submodels (i.e. clusters) that is trained using expectation-maximization. The second method is a naive Bayes classifier, a supervised learning approach. Both methods are tested on realworld data concerning 1936 drives. The predictive accuracy of both algorithms is far higher than the accuracy of thresholding methods used in the disk drive industry today. 1.
(Show Context)

Citation Context

...ut is when trained multiple times under slightly varying initial conditions, as created for example by cross-validation. We address the problem of zero counts by adding artificial counts to all bins (=-=Ristad, 199-=-5). This has the effect of increasing low probabilities, and decreasing high probabilities. A popular method of zero count smoothing is Lidstone 's law of succession (Lidstone, 1920): P (xjk) = n x + ...

Athena: Mining-based interactive management of text databases

by Rakesh Agrawal, Roberto Bayardo, Ramakrishnan Srikant - International Conference on Extending Database Technology , 2000
"... Abstract. We describe Athena: a system for creating, exploiting, and maintaining a hierarchy of textual documents through interactive miningbased operations. Requirements of any such system include speed and minimal end-user e ort. Athena satis es these requirements through linear-time classi cation ..."
Abstract - Cited by 43 (3 self) - Add to MetaCart
Abstract. We describe Athena: a system for creating, exploiting, and maintaining a hierarchy of textual documents through interactive miningbased operations. Requirements of any such system include speed and minimal end-user e ort. Athena satis es these requirements through linear-time classi cation and clustering engines which are applied interactively to speed the development of accurate models. Naive Bayes classi ers are recognized to be among the best for classifying text. We show that our specialization of the Naive Bayes classi er is considerably more accurate (7 to 29 % absolute increase in accuracy) than a standard implementation. Our enhancements include using Lidstone's law of succession instead of Laplace's law, under-weighting long documents, and over-weighting author and subject. We also present a new interactive clustering algorithm, C-Evolve, for topic discovery. C-Evolve rst nds highly accurate cluster digests (partial clusters), gets user feedback to merge and correct these digests, and then uses the classi cation algorithm to complete the partitioning of the data. By allowing this interactivity in the clustering process, C-Evolve achieves considerably higher clustering accuracy (10 to 20 % absolute increase in our experiments) than the popular K-Means and agglomerative clustering methods. 1
(Show Context)

Citation Context

... (1) where jV j is the size of the vocabulary (i.e., the number of distinct words in the dataset). The above formula is the result of assuming that all possible words are a priori equally likely (see =-=[Ris95]-=- for details). Following [MN98], we use the multinomial form of the Naive Bayes classifier where each document is treated as a bag of words rather than a set of words to yield better accuracy. 2.1 Enh...

Efficient Bayesian Parameter Estimation in Large Discrete Domains

by Nir Friedman, Yoram Singer - Advances in Neural Information Processing Systems , 1999
"... In this paper we examine the problem of estimating the parameters of a multinomial distribution over a large number of discrete outcomes, most of which do not appear in the training data. We analyze this problem from a Bayesian perspective and develop a hierarchical prior that incorporates the assum ..."
Abstract - Cited by 38 (1 self) - Add to MetaCart
In this paper we examine the problem of estimating the parameters of a multinomial distribution over a large number of discrete outcomes, most of which do not appear in the training data. We analyze this problem from a Bayesian perspective and develop a hierarchical prior that incorporates the assumption that the observed outcomes constitute only a small subset of the possible outcomes. We show how to efficiently perform exact inference with this form of hierarchical prior and compare our method to standard approaches and demonstrate its merits. Category: Algorithms and Architectures Presentation preference: none This paper was not submitted elsewhere nor will be submitted during NIPS review period. 1 Introduction One of the most important problems in statistical inference is multinomialestimation: Given a past history of observations independent trials with a discrete set of outcomes, predict the probability of the next trial. Such estimators are the basic building blocks in mor...
(Show Context)

Citation Context

...t. Our method is based on an efficient inference algorithm that is based on hierarchical prior. Among the numerous techniques that have been used for multinomial estimation the one proposed by Ristad =-=[13]-=- is the closest to ours. Though the methodology used by Ristad is substantially different than ours, his method can been seen as a special case of sparse-multinomials with ff set to 1 and specific for...

Cross-training: Learning probabilistic mappings between topics

by Sunita Sarawagi, Soumen Chakrabarti, Shantanu Godbole - In Proc. of the 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining , 2003
"... Classification is a well-established operation in text mining. Given a set of labels A and a set DA of training documents tagged with these labels, a classifier learns to assign labels to unlabeled test documents. Suppose we also had available a di#erent set of labels B, together with a set of docum ..."
Abstract - Cited by 23 (2 self) - Add to MetaCart
Classification is a well-established operation in text mining. Given a set of labels A and a set DA of training documents tagged with these labels, a classifier learns to assign labels to unlabeled test documents. Suppose we also had available a di#erent set of labels B, together with a set of documents DB marked with labels from B. If A and B have some semantic overlap, can the availability of DB help us build a better classifier for A, and vice versa? We answer this question in the a#rmative by proposing cross-training : a new approach to semi-supervised learning in presence of multiple label sets. We give distributional and discriminative algorithms for cross-training and show, through extensive experiments, that cross-training can discover and exploit probabilistic relations between two taxonomies for more accurate classification.

A Study Of n-Gram And Decision Tree Letter Language Modeling Methods

by Gerasimos Potamianos , Frederick Jelinek - SPEECH COMMUNICATION , 1998
"... The goal of this paper is to investigate various language model smoothing techniques and decision tree based language model design algorithms. For this purpose, we build language models for printable characters (letters), based on the Brown corpus. We consider two classes of models for the text gene ..."
Abstract - Cited by 19 (2 self) - Add to MetaCart
The goal of this paper is to investigate various language model smoothing techniques and decision tree based language model design algorithms. For this purpose, we build language models for printable characters (letters), based on the Brown corpus. We consider two classes of models for the text generation process: the n-gram language model and various decision tree based language models. In the first part of the paper, we compare the most popular smoothing algorithms applied to the former. We conclude that the bottom-up deleted interpolation algorithm performs the best in the task of n-gram letter language model smoothing, significantly outperforming the back-off smoothing technique for large values of n. In the second part of the paper, we consider various decision tree development algorithms. Among them, a K-means clustering type algorithm for the design of the decision tree questions gives the best results. However, the n-gram language model outperforms the decision tree language models for letter language modeling. We believe that this is due to the predictive nature of letter strings, which seems to be naturally modeled by n-grams.
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University