| A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996. |
....(NG20) mini 20 newsgroups (mini20) and the CLASSIC datasets for empirical performance analysis on text data. The NG20 dataset is a collection of 20,000 messages, collected from 20 di#erent usenet newsgroups, 1,000 messages from each. We preprocessed the raw dataset using the Bow toolkit [23], including chopping o# headers and removing stop words as well as words that occur in less than three documents. In the resulting dataset, each document is represented by a 43,586 dimensional sparse vector and there are a total of 19,949 documents (with empty documents removed, still around 1,000 ....
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.
....nodes. The list of categories and urls we used is available at www.cs.utexas.edu users manyam dmoz.txt. While indexing we skipped text between html tags, pruned words occurring in less than ve documents, used a stop list but did not use stemming. The resulting vocabulary had 14,538 words. Bow[22] is a library of C code useful for writing text analysis, language modeling and information retrieval programs. We extended Bow to index BdB (www.sleepycat.com) at le databases where we stored the text documents for e cient retrieval and storage. We implemented Agglomerative and Divisive ....
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classi cation and clustering. www.cs.cmu.edu/ mccallum/bow, 1996.
....rows, lists the number of positive and negative examples of each query. The second part gives the predictive accuracy of Toogle for two learning methods: the Support Vector Machine (SVM) and the Nave Bayes Classifier (NBC) We used the SVMLight [1] implementation for the SVM, and the Bow toolkit [3] for NBC. The predictive accuracy is defined as the fraction of examples which label ( positive or negative ) was correctly predicted by Toogle when trained on the examples of the first result page. The third part proposes an evaluation of the browsing gain for the end user of Toogle, in term of ....
McCallum A. K. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, 1996. http://www.cs.cmu.edu/~mccallum/bow.
....1460 abstracts from information retrieval papers and CRANFIELD consists of 1400 abstracts from aerodynamic systems. After removing stop words and numeric characters we selected the top 2000 words by mutual information as part of our pre processing. We will refer to this data set as CLASSIC3. Bow [16] is a library of C code useful for writing text analysis, language modeling and information retrieval programs. We extended Bow with our co clustering and 1D clustering procedures, and used MATLAB for spy plots of matrices. 5.2 Evaluation Measures Validating clustering results is a non trivial ....
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classi cation and clustering. www.cs.cmu.edu/ mccallum/bow, 1996.
....this task in four different ways. The first is a language modeling variant of a K nearest neighbors (KNN) algorithm. The second, which we call the class models approach, is a variant of centroid based clustering [12] In addition to the above two approaches, we also used the Rainbow toolkit [16] for categorization using naive Bayes [15] and maximum entropy [18] as baselines. These have been shown to have good performance. The 5 Nearest Neighbors (5NN) approach is as follows. The entities in a training set are initially assigned to their classes on the basis of human judgments. An ....
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.
....third data set consisted of acoustic features from recorded music. Finally, I examine the effect of adding set information to the joint probabilistic model described by Cohn and Hofmann [3] 3. 1 WebKB data The first set of experiments began with a subset of the WebKB data set [4] Using Rainbow [9], I tokenized 1000 randomly selected documents, stripping out HTML and digits, and kept the 1000 terms with highest class dependent information gain (the reduced vocabulary greatly decreased processing times) The result was 1000 documents with 1000 features, where feature f i j represented ....
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.
....retrieval papers and CRANFIELD consists of 1400 abstracts from aerodynamic systems. After removing stop words and numeric characters we selected the top 2000 words by mutual information as part of our pre processing. We will refer to this data set as CLASSIC3. 5. 2 Implementation Details Bow [15] is a library of C code useful for writing text analysis, language modeling and information retrieval programs. We extended Bow to implement co clustering and document clustering and used MATLAB to give spy plots of the matrices. The data sets used in [18] and [8] di er in their pre processing ....
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classi cation and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.
....documents) LISA (6004 text collections) and CISI (1460 abstracts from the Institute of Scientific Information) Each of these text collections is broken into a number of separate files with about one hundred abstracts or text collections for each file. An available public domain tool, Rainbow [25], was employed for text extraction. The text was tokenized using common tokenization options: the words from the SMART stop list (524 common words) 4] such as the and of , are neglected before tokenization; the Porter stemming algorithm [12] was applied for all words before they are counted. ....
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.
....nodes. The list of categories and urls we used is available at www.cs.utexas.edu users manyam dmoz.txt. While indexing we skipped text between html tags, pruned words occurring in less than ve documents, used a stop list but did not use stemming. The resulting vocabulary had 14,538 words. Bow[22] is a library of C code useful for writing text analysis, language modeling and information retrieval programs. We extended Bow to index BdB (www.sleepycat.com) at le databases where we stored the text documents for e cient retrieval and storage. We implemented Agglomerative and Divisive ....
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classi cation and clustering. www.cs.cmu.edu/ mccallum/bow, 1996.
.... dimension name= binding type= integer . Figure 6: XML Snippet for the OHSU Medical Abstracts DataSet The Data Mapping Generator for the text visualization system is responsible for text extraction, pruning and dimension reduction. 12] An available public domain tool, rainbow [24], was em Figure 5: Text Visualization within XmdvTool (a) Without annotation (b) With annotation (along with a uri display window showing the complete text for the selected document) ployed for text extraction. The text was tokenized by the most used tokenization options; the words from the ....
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/mccallum/bow, 1996.
....documents. The list of categories and urls we used is available at www.cs.utexas.edu users manyam dmoz.txt. While indexing we skipped text between html tags, pruned words occurring in less than ve documents, used a stop list but did not use stemming. The resulting vocabulary had 14,538 words. Bow[23] is a library of C code useful for writing text analysis, language modeling and information retrieval programs. We extended Bow to index BdB (www.sleepycat.com) at le databases where we stored the text documents for ecient retrieval and storage. We implemented Agglomerative and Divisive ....
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classi cation and clustering. www.cs.cmu.edu/ mccallum/bow, 1996.
....module takes over. It uses text classification techniques. Given a statistical language model that indicates how strongly each word is associated with domain classifications (e.g. misconceptions) it computes the most likely classification of each sentence using either a nave Bayesian approach [8] or LSA. Besides this statistical language model, the SLU requires three other large knowledge sources: a meaning representation language definition, a grammar and a lexicon. The meaning representation language definition specifies the domain predicates and types that can occur in the logical ....
McCallum, A.K., Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. 1996, CMU: Pittsburgh, PA.
....articles in 20 newsgroups. We focus on three datasets, each has two newsgroups: 1 2: alt.atheism comp.graphics 10 11: rec.sport.baseball rec.sport.hockey 18 19: talk.politics.mideast talk.politics.misc (The newsgroup dataset together with the bow toolkit for processing is available online[19]) Word document matrix X is rst constructed. 2000 words are selected according to the mutual information between words and documents I(w) d p(w; d)log 2 [p(w; d) p(w)p(d) where w represents a word and d represents a document. Words are stemmed using [19] Standard tf.idf scheme for ....
....processing is available online[19] Word document matrix X is rst constructed. 2000 words are selected according to the mutual information between words and documents I(w) d p(w; d)log 2 [p(w; d) p(w)p(d) where w represents a word and d represents a document. Words are stemmed using [19]. Standard tf.idf scheme for term weighting is used and standard cosine similarity between two documents d 1 ; d 2 sim(d 1 ; d 2 ) d 1 d 2 =jd 1 jjd 2 j is used. When each document, column of X, is normalized to 1 using L 2 norm, documentdocument similarities are calculated as W = X X. W ....
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classi cation and clustering. http://www.cs.cmu.edu/mccallum/bow, 1996.
....newsgroups: rec.autos, rec. motorcycles rec.sport.baseball, rec.sport.hockey, talk.politics.guns, misc.forsale, talk. politics.mideast and talk.religion.misc. To extract the necessary statistics from the data set we made gratefully use of the Rainbow program, part of the Bow Toolkit by McCallum [11]. To measure predictive performance we used a test set of fty documents per class. It turned out that the variance in predictive performance on di erent smaller test sets was negligible, therefore we decided to use just one test set. As preprocessing we used stemming, a stop list and removed ....
A.K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classi cation and clustering. fetched May 24 2000.
....as vectors in a high dimensional term space. The di#erence is that SVM picks a hyperplane embedded midway in the thickest possible slab passing between positive and negative examples and containing no training point. NB takes time essentially linear in the number n of training documents [13, 14], whereas SVMs take time proportional to n , where a is typically between 1.8 and 2.1. Thanks to some clever implementations [17, 9] SVMs have been trained on several thousand instances despite their nearquadratic complexity. However, to achieve this, they hold the entire training data in main ....
....is formulated. All tokens are turned to lowercase and standard SMART stopwords (ftp: ftp.cs. cornell.edu pub smart ) are removed, but no stemming is performed. No feature selection is used prior to running any of our classification algorithms: the naive Bayes classifier in the Rainbow library [13] (with Laplace and Lidstone s methods evaluated for parameter smoothing) SVMlight and SIMPL. Alternatively, one may preprocess the collection through a common feature selector and then submit them to each classifier, which adds a fixed time to each classifier. 4.1 Accuracy The hill climbing ....
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Software available from http://www.cs. cmu.edu/~mccallum/bow/, 1998.
....sequencial IB algorithm in the comparisons below. 6.2 The evaluation method Unfortunately there is no clear standard about what should be referred as a le header in this corpus. In particular, the results reported in [15, 16] stripped of the header including the subject line (as instructed in [9]) On the other hand, the results reported in [1, 5] does make use of the subject line which in many cases contain useful information. To make our results comparable with [5] we decided to use the subject line in this paper. Speci cally we united the 5 comp categories, the 3 religion ....
McCallum, Andrew Kachites. Bow: A toolkit for statistical language modeling, text retrieval, classi cation and clustering. http://www.cs.cmu.edu/mccallum/bow, 1996.
....keywords or to other documents, we use the TF IDF Rocchio method [71] a popular weighting and classification scheme for text documents. This method assigns a similarity score based on a term s frequency in the document and its inverse frequency over all documents. We use the Bow toolkit [55] in our implementation. 7 Multimodal Queries Since text and shape queries can provide orthogonal notions of similarity corresponding to function and form, our search engine allows them to be combined. We support this feature in two ways. First, text keywords and 2D 3D sketches may be entered in ....
....via TCP to a matching server (running on a Dell Precision 530 PC with two 1.5GHz Pentium III processors and 1 GB of memory) There, a Perl job control script forks a separate process for each incoming query. Text queries are stemmed and passed directly to the Bow toolkit classifier rainbow [55]. For 2D sketches and uploaded 3D model files a shape signature is computed and compared against an in memory index by a separate 14 shape matching process. All match results (model ids, scores, statistics) are returned to the web server, which constructs a web page with results and returns it to ....
A.K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.
....removal and stemming, a feature vector was obtained for each document without applying any other feature selection. The feature vector recorded the words that appear in the document and their frequencies. The stopword list and the stemming algorithm were taken directly from the BOW library [8]. 4.2 Search Engine for Training Document Pool Given the keyword based category profiles, positive and negative training documents are selected from the training document pool for the construction of the personalized classifiers. In our work, a small scale search engine is implemented to search ....
A. K. McCallum. BOW: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/#mccallum/bow, 1996.
....the problem even more difficult for our users. To address this problem, we have built an automatic text classifier that can suggest terms in a hierarchy that are appropriate for classifying a talk based on its title and abstract. The classifier package used was from the Bag Of Words (BOW) toolkit [23] by Andrew McCallum at CMU. This library provides support for a wide variety of text classification and retrieval algorithms. We used the Naive Bayes algorithm, which is widely used in the classification literature, fairly effective, and quick to learn the 285 classes in our test collection. We ....
McCallum and Andrew Kachites. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, 1996. site: http://www.cs.cmu.edu/ mccallum/bow.
....by himself herself. This is the only information given by the user to the system to produce the report. In the other steps of the method, the processes are executed automatically. 5.b. Classification of Paragraphs In order to implement classification phase of the ARG Tool, we have used Rainbow [ 15], which is a program that performs statistical text classification using naive Bayes algorithm. To prepare the compiled articles we first split them into paragraphs and enumerate them according to their article number. For example, if the first article is stored in file 1.txt, its third paragraph ....
McCallum, Andrew Kachites, Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, 1996. http://www.cs.cmu.edu/-mccallum/bow.
....482 leaf nodes and a total of 144,859 sample URLs. Out of these we could successfully fetch about 120,000 URLs. All nodes have equal degree in a regular graph. For the classifier we used the public domain BOW toolkit and the Rainbow naive Bayes (NB) classifier created by McCallum and others [23]. Bow and Rainbow are very fast C implementations which let us classify pages in real time as they were being crawled. Rainbow s naive Bayes learner is given a set of training documents, each labeled with one of a finite set of classes topics. A document d is modeled as a multiset or bag of ....
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Software available from http://www.cs.cmu.edu/~mccallum/bow/, 1998.
....HTML parser and cleaner for future work. We intend to make our crawler and HTML parser code available in the public domain for research use. For both the baseline and apprentice classifier we used the public domain BOW toolkit and the Rainbow naive Bayes classifier created by McCallum and others [20]. Bow and Rainbow are very fast C implementations which let us classify pages in real time as they were being crawled. 3.1 Design of the topic taxonomy We downloaded from the Open Directory (http: dmoz. org ) an RDF file with over 271954 topics arranged in a tree hierarchy with depth at least ....
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Software available from http://www.cs.cmu.edu/~mccallum/bow/, 1998.
....of categories and urls is available at www.cs.utexas.edu users manyam dmoz.txt. While indexing we skipped text between html tags, pruned words occurring in less than ve documents, used a stop list, but did not use stemming. The resulting vocabulary had 14538 words. 6. 2 Implementation Details Bow[22] is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs. We extended Bow to implement Distributional Clustering, Divisive Clustering and indexing BdB les. We wrote a Perl wrapper around Bow to achieve Hierarchical Classi cation. ....
....evaluate the tradeo between soft and hard clustering. Also, for our word clustering algorithm we intend to study other measures of distributional similarity [19] Acknowledgements. We are grateful to Andrew McCallum for some helpful discussions and for making the Bow software library [22] publicly available. For this research, ISD was supported by a NSF CAREER Grant (No. ACI 0093404) while Mallela was supported by the University of Texas Austin MCD Fellowship. 15 ....
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classi cation and clustering. www.cs.cmu.edu/ mccallum/bow, 1996.
....some of the independent variables) if that leads to better prediction accuracy. Over the years a variety of di#erent classification algorithms have been developed by the machine learning community. Examples of such algorithms are decision tree based [1, 26, 25] rule based [4, 5] probabilistic [18], neural networks [23, 34] genetic [12] instance based [8, 35] and support vector machines [31, 32] Depending on the characteristics of the data sets being classified certain algorithms tend to perform better than others. In recent years, algorithms based on the support vector machines and the ....
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/#mccallum/bow, 1996.
No context found.
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.
No context found.
McCallum, A. K. 1996. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow.
No context found.
McCallum, A. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classi cation and clustering. http://www.cs.cmu.edu/ mccallum /bow.
No context found.
McCallum, A. K. 1996. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow.
No context found.
McCallum, A.K. 1996. Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering. http://www-2.cs.cmu.edu/~mccallum/bow/.
No context found.
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996. 6
No context found.
McCallum, A. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classi cation and clustering. http://www.cs.cmu.edu/ mccallum /bow.
No context found.
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classi cation, and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.
No context found.
McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering.
No context found.
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/mccallum/bow, 1996.
No context found.
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.
No context found.
McCallum, A.K. 1996. Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering. http://www-2.cs.cmu.edu/~mccallum/bow/.
No context found.
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/#mccallum/bow, 1996.
No context found.
McCallum, A.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/mccallum/bow (1996)
No context found.
A.K. McCallum, Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, 1996, http://www.cs.cmu.edu/~mccallum/bow.
No context found.
A. McCallum, \Bow: A toolkit for statistical language modeling, text retrieval, classi cation and clustering. " http://www.cs.cmu.edu/~mccallum/bow, 1996.
No context found.
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/#mccallum/bow, 1996.
No context found.
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classi cation and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.
No context found.
McCallum, A.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/mccallum/bow (1996)
No context found.
A. McCallum. Bow: a toolkit for statistical language modeling, text retrieval, classification and clustering. In http://www.cs.cmu.edu/#mccallum/bow, 1996.
No context found.
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~ mccallum/bow, 1996.
No context found.
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.
No context found.
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Software available from http://www.cs.cmu.edu/~mccallum/bow/, 1998.
No context found.
A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classi cation and clustering. http://www.cs.cmu.edu/mccallum/bow, 1996.
No context found.
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/#mccallum/bow, 1996.
No context found.
McCallum, Andrew Kachites. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC