| S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Proceedings of the 23rd VLDB Conference, Athens, Greece, 1997. |
....been used, but McCallum and Nigam [21] showed that the variant based on the multinomial model leads to better results. For hierarchical text data, such as the topic hierarchies of Yahoo (www.yahoo.com) and the Open Directory Project (www.dmoz.org) hierarchical classi cation has been studied in [18, 5]. For more details, see Section 4. To counter high dimensionality various methods of feature selection have been proposed in [30, 18, 5] Distributional clustering of words was rst proposed by Pereira, Tishby Lee in [25] where they used soft distributional clustering to cluster nouns ....
....text data, such as the topic hierarchies of Yahoo (www.yahoo.com) and the Open Directory Project (www.dmoz.org) hierarchical classi cation has been studied in [18, 5] For more details, see Section 4. To counter high dimensionality various methods of feature selection have been proposed in [30, 18, 5]. Distributional clustering of words was rst proposed by Pereira, Tishby Lee in [25] where they used soft distributional clustering to cluster nouns according to their conditional verb distributions. Note that since our main goal is to reduce the number of features and the model size, we are ....
S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Proceedings of the 23rd VLDB Conference, 1997.
....been used, but McCallum and Nigam [21] showed that the variant based on the multinomial model leads to better results. For hierarchical text data, such as the topic hierarchies of Yahoo (www.yahoo.com) and the Open Directory Project (www.dmoz.org) hierarchical classi cation has been studied in [18, 5]. For more details, see Section 4. To counter high dimensionalityvarious methods of feature selection have been proposed in [30, 18, 5] Distributional clustering of words was rst proposed byPereira, Tishby Lee in [25] where they used soft distributional clustering to cluster nouns according ....
....text data, such as the topic hierarchies of Yahoo (www.yahoo.com) and the Open Directory Project (www.dmoz.org) hierarchical classi cation has been studied in [18, 5] For more details, see Section 4. To counter high dimensionalityvarious methods of feature selection have been proposed in [30, 18, 5]. Distributional clustering of words was rst proposed byPereira, Tishby Lee in [25] where they used soft distributional clustering to cluster nouns according to their conditional verb distributions. Note that since our main goal is to reduce the number of features and the model size, we are ....
S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Proceedings of the 23rd VLDB Conference, 1997.
....that face t turns up when coin c is tossed. r(c) is the prior probability of class c, typically, the fraction of documents in the training or testing set that from class c. n(d,t) is the number of times term t occurred in document d. The details of the estimation of 0 have been described in [33]. An alternative binary model truncate n(d,t) the number of times term t occurs in document d, to a 0, 1 value. Classification over the entire taxonomy can then be posed as a shortest path problem on the taxonomy. A sketch of the TAPER hierarchical feature selection and classification engine ....
R. Agrawal S. Chakrabarti, B. Dom and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In VLDB 1998.
....engine. The problem is somewhat less extreme for classification tasks, where we can in some cases arrange to compare posterior log odds of classes for each document individually, without comparisons across documents. Indeed, we know of many applications of multinomial models to text categorization [3, 14, 15, 25, 32, 34] but none to text retrieval. 5.3 Non Distributional Approaches A variety of ad hoc approaches have been developed that more or less gracefully integrate term frequency and document length information into the BIM itself. The widely used probabilistic indexing approach assumes there is an ideal ....
Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, and Prabhakar Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Matthias Jarke, Michael Carey, Klaus R. Dirtrich, Fred Lochovsky, Pericles Loucopoulos, and Manfred A. Jeusfeld, editors, Proceedings of the 3rd VLDB Conference, pages 446-455, 1997.
....of the internet. Several methods from simple probabilistic Naive Bayes to the complex SVMs have been used for text categorization [22, 17] An inherent problem of text data is its high dimensionality. To counter high dimensionality, various methods of feature selection have been proposed in [30, 18, 5]. Distributional clustering of words was rst proposed by Pereira, Tishby and Lee in [25] where they used soft distributional clustering to cluster nouns according to their conditional verb distributions. Note that since our main goal is to reduce the number of features and the model size, we ....
S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Proceedings of the 23rd VLDB Conference, 1997.
....been used, but McCallum and Nigam [21] showed that the variant based on the multinomial model leads to better results. For hierarchical text data, such as the topic hierarchies of Yahoo (www.yahoo.com) and the Open Directory Project (www.dmoz.org) hierarchical classi cation has been studied in [17, 4, 10]. For some more details, see Section 4.1. To counter high dimensionality, various methods of feature selection have been proposed in [32, 17, 4] Distributional clustering of words was rst proposed by Pereira, Tishby and Lee in [25] where they used soft distributional clustering to cluster ....
....such as the topic hierarchies of Yahoo (www.yahoo.com) and the Open Directory Project (www.dmoz.org) hierarchical classi cation has been studied in [17, 4, 10] For some more details, see Section 4.1. To counter high dimensionality, various methods of feature selection have been proposed in [32, 17, 4]. Distributional clustering of words was rst proposed by Pereira, Tishby and Lee in [25] where they used soft distributional clustering to cluster nouns according to their conditional verb distributions. Note that since our main goal is to reduce the number of features and the model size, we ....
S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Proceedings of the 23rd VLDB Conference, 1997.
....paradigm for structuring the information and making its search and access more e#cient. Library catalogs, file systems, botanical and animal classifications and many internet directories, such as Yahoo , are examples of taxonomies. The taxonomies allow searching by concept rather than by keyword [CDAR97] which is much more e#cient when no particular keyword is known or a keyword has di#erent interpretations. A concept can be specified as a path in the taxonomy. Definition 1.1 (Classification) Classification is the process of finding a set of models (or functions) that describe and distinguish ....
....element of the document, such as a single word, phrase, author, subject, etc. Unstructured terms, such as words and phrases, are also called text terms, while structured terms are also called non text terms. Main categories of methods used for text classification are Nave Bayes [Lew98, CDAR97] Nearest Neighbor [MLW92] neural networks [NGL97] Support Vector Machines [Joa98] regression [YC94] decision trees [ADW98] TF IDF style classifiers [SM83, pages 78 81] BS95, Roc71] and associative classifiers [LHM98, WZL99] To relax the independence assumption behind Nave Bayes ....
[Article contains additional citation context not shown here]
S. Chakrabarti, B. E. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Proc. 23rd Int. Conf. Very Large Data Bases (VLDB'97),, pages 446--455, Athens, GR, 1997.
....the keywords and their occurrence within the page. 2.2 Classifier The two main phases for the classifier are training and classifying. Training produces an evaluation model for classification. To construct the evaluation model, a feature selecting process is needed. We use Fisher s discriminant [1] to calculate scores of all the attributes from the HTML Parser and then carry out feature selection based on the scores. We choose a Nave Bayesian based classifier [2] to perform the text based classification task. This classifier uses Bayes rule to carry out classification from the text ....
S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan, Using taxonomy, discriminants, and signatures for navigating in text databases. In Proceedings of the 23 rd VLDB Conference, Athens, 1997.
....Library and Yahoo . sdm2001 2001 1 30 page 3 i i i i i i i i 3 2 Related work With few exceptions, most classification systems assume that all classes are at a flat level and each document is labeled by one class [5, 6, 9] Recently, hierarchically structured classes were examined in [3, 4, 7] where classes are organized into a hierarchy of increasing specificity and a document is labeled by one class in the hierarchy. Though a document belonging to a child class is automatically considered as belonging to a parent class, a document is not allowed to belong to two classes not on a ....
S. Chakrabarti, D. Dom, R. Agrawal, and P. Raghavan, Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases, VLDB 1997.
....engine. The problem is somewhat less extreme for classification tasks, where we can in some cases arrange to compare posterior log odds of classes for each document individually, without comparisons across documents. Indeed, we know of many applications of multinomial models to text categorization [3, 14, 15,25, 32, 34] but none to text retrieval. 5.3 Non Distributional Approaches A variety of ad hoc approaches have been developed that more or less gracefully integrate term frequency and document length information into the BIM itself. The widely used probabilistic indexing approach assumes there is an ideal ....
Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, and Prabhakar Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Matthias Jarke, Michael Carey, Klaus R. Dittrich, Fred Lochovsky, Pericles Loucopoulos, and Manfred A. Jeusfeld, editors, Proceedings of the 23rd VLDB Conference, pages 446--455, 1997.
.... (see Yahoo, US Patent databases, CNN and other major Internet news directories [13, 9, 2] Querying with respect to a concept hierarchy is significantly more efficient and reliable than searching for specific keywords since the views of the data collected are refined as we go down the hierarchy [1]. Second, when text classification is our chosen approach, an efficient algorithm should be used. A number of algorithms have been proposed and their performances are compared in the literature [14] We use the TFIDF text classifier [11, 4] and proceed with the following steps for hierarchical ....
....similar to hierarchical classification based on a pre defined concept hierarchy. Many of them are combined with feature subset selection which finds the best subset of features that improves classification accuracy, reduces measurement cost, storage, and computational overhead. One example, TAPER [1], makes use of a concept hierarchy and classifies text using statistical pattern recognition techniques. It finds feature subsets by the Fisher s discriminant. Similarly, Mladenic and Grobelnik proposed a document categorization method based on a concept hierarchy [8] They used the naive Bayesian ....
S. Chakrabarti, B. Dom, R. Agrawal, P. Raghavan. Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases. In Proceedings of the 23rd VLDB Conference, 1997.
....that face t turns up when coin c is tossed. c) is the prior probability of class c, typically, the fraction of documents in the training or testing set that from class c. n(d,t) is the number of times term t occurred in document d. The details of the estimation of have been described in [33]. An alternative binary model truncate n(d,t) the number of times term t occurs in document d, to a f0, 1g value. Classi cation over the entire taxonomy can then be posed as a shortest path problem on the taxonomy. A sketch of the TAPER hierarchical feature selection and classi cation engine is ....
R. Agrawal S. Chakrabarti, B. Dom and P.Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In VLDB 1998, 1998.
....If the vocabulary in N is quite different from M, the classification accuracy will certainly be affected. However, this problem is orthogonal to the idea of using the similarity information in N . Our model flattens the catalog hierarchy and treats it as a set of categories. Past studies [6] [3] have shown that exploiting the hierarchical structure can lead to better classification results than using the flattened structure. Our enhancementsfor using the information in N can be easily incorporated in a Naive Bayes hierarchical classifier, such as one used in [3] Incorporation of these ....
....Past studies [6] 3] have shown that exploiting the hierarchical structure can lead to better classification results than using the flattened structure. Our enhancementsfor using the information in N can be easily incorporated in a Naive Bayes hierarchical classifier, such as one used in [3]. Incorporation of these enhancements into other classification schemes, such as the SVM classifier used in [6] requires further work. Another issue related to hierarchies is that the hierarchy in M may be more detailed than N or vice versa. If M is more detailed than N , our technique can still ....
[Article contains additional citation context not shown here]
S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases. In Proc. of the 23rd Int'l Conf. on Very Large Databases, pages 446--455, 1997.
....component is used for hierarchy reorganization, document routing, and identification of misfiled documents. We decided to base our classifier on the Naive Bayes model [Goo65] for the following reasons: Naive Bayes classifiers are very competitive with other techniques for text classification [CDAR97] [LR94] Lan95] PB97] MN98] 2 They stabilize quickly [Koh96] which supports automated hierarchy reorganization with a limited number of examples. They are fast. They can be constructed quickly with a single pass over the documents, making them suitable for on line model creation; they ....
....stabilize quickly [Koh96] which supports automated hierarchy reorganization with a limited number of examples. They are fast. They can be constructed quickly with a single pass over the documents, making them suitable for on line model creation; they also quickly classify incoming documents [CDAR97]. They are simple to update in the presence of document additions or deletions, making them easy to maintain. 2 We also experimented with using the SPRINT decision tree classifier [SAM96] but found it had low accuracy in this domain due to the small number of examples per class and the ....
[Article contains additional citation context not shown here]
S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases. In Proc. of the 23rd Int'l Conf. on Very Large Databases, pages 446--455, 1997.
....for hierarchy reorganization, document routing Inbox visualization, and identification of misfiled documents. We decided to base our classifier on the Naive Bayes model [Goo65] for the following reasons: ffl Naive Bayes classifiers are very competitive with other techniques for text classification [CDAR97] [LR94] Lan95] PB97] MN98] 2 ffl They stabilize quickly [Koh96] which is useful for hierarchy reorganization. ffl They are fast. They can be constructed quickly with a single pass over the documents, making them suitable for on line model creation; they also quickly classify incoming ....
....[Lan95] PB97] MN98] 2 ffl They stabilize quickly [Koh96] which is useful for hierarchy reorganization. ffl They are fast. They can be constructed quickly with a single pass over the documents, making them suitable for on line model creation; they also quickly classify incoming documents [CDAR97]. ffl They are simple to update in the presence of document additions or deletions, making them easy to maintain. The basic Naive Bayes classifier estimates the posterior probability of class C i given a document d via Bayes rule: Pr(C i jd) Pr(djC i ) Theta Pr(C i ) Pr(d) We ignore Pr(d) ....
[Article contains additional citation context not shown here]
S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases. In Proc. of the 23rd Int'l Conf. on Very Large Databases, pages 446--455, 1997.
No context found.
S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Proceedings of the 23rd VLDB Conference, Athens, Greece, 1997.
No context found.
S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Proceedings of the 23rd VLDB Conference, Athens, Greece, 1997.
No context found.
Chakrabarti, S., Dom, B., Agrawal, R. and Raghavan, P. Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases. in Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB), Athens, Greece, 1997, 446-455.
No context found.
S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In M. Jarke, M.J. Carey, K.R. Dittrich, F.H. Lochovsky, P. Loucopoulos, and M. A. Jeusfeld, editors, VLDB'97, Proc. of 23rd Int. Conf. on Very Large Data Bases, pages 446--455. Morgan Kaufmann, 1997.
No context found.
S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In M. Jarke, M. Carey, K. Dittrich, F. Lochovsky, P. Loucopoulos, and M. A. Jeusfeld, editors, VLDB'97, Proc. of 23rd Int. Conf. on Very Large Data Bases, pages 446--455. Morgan Kaufmann, 1997.
No context found.
S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In M. Jarke, M. Carey, K. Dittrich, F. Lochovsky, P. Loucopoulos, and M. A. Jeusfeld, editors, VLDB'97, Proc. of 23rd Int. Conf. on Very Large Data Bases, pages 446--455. Morgan Kaufmann, 1997.
No context found.
Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P. Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases. VLDB: 446-455, 1997.
No context found.
S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 446-455, Athens, Greece, August 1997. 150
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC