| K. Nigam, A. McCallum, S. Thrun and T. Mitchell: "Text Classification from Labeled and Unlabelled Document using EM," Machine Learning, Vol. 39, No. 2/3, pp. 103-134, 2000. |
....a topic specific search engine is to extract values for specific fields from the web pages, storing the values in a database so that structured queries can be performed over the extracted information. To aid in this endeavor, much research has been done in the area of text classification (e.g. [9,10,11]) Text classification is used in topic specific search engines in at least two areas. First, it is used during crawling to classify web pages as to whether they are relevant to the given topic. Second, it can be used to classify relevant web pages or content extracted from web pages into a ....
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabelled doc-uments using EM. Machine Learning, 1999
....documents in additional to graph structure of web pages can be used to augment learning relations. Soumen Chakrabarti [7] 4. 2 Semi supervised learning Semi supervised learning is a goal directed activity, which can be precisely evaluated, whereas unsupervised learning is open to interpretation [47]. On the other hand, supervised learning needs a large training data set, which must be obtained through human effort [47] In real life, most often one has a relatively small collection of labeled training data, but a larger pool of unlabeled data. In the web context our training data is a small ....
....4. 2 Semi supervised learning Semi supervised learning is a goal directed activity, which can be precisely evaluated, whereas unsupervised learning is open to interpretation [47] On the other hand, supervised learning needs a large training data set, which must be obtained through human effort [47]. In real life, most often one has a relatively small collection of labeled training data, but a larger pool of unlabeled data. In the web context our training data is a small set of labeled documents. The label is document class, and our goal is to guess the label of an un seen document. In this ....
[Article contains additional citation context not shown here]
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, Text classification from labeled and unlabelled documents using EM, Machine Learning Journal, 1999.
.... other . To solve this problem we use a modification of the Naive Bayes Classifier for each layer. This classifier architecture provides reasonable performance, high speed, meets the requirement of our system that a likelihood estimate be provided for each classification, and is well studied[17, 18, 12]. Assume that we have a document d i represented by the vector corresponding to the reduced TFIDF representation relative to the vocabulary V . Documents from class c j , defined to correspond to layer j, are assumed to have a prior probability of being found on the web which we denote P (c j ) ....
.... words in the documents of class c j : P (w t jc j ) 1 P d i 2D j N(w t ; d i )P (c j jd i ) jV j P d i 2D j P jV j s=1 N(w s ; d i )P (c j jd i ) 4) where N(w t ; d i ) is the number of occurrences of w t in the document d i and jV j is the number of phrases in the vocabulary V [17, 18, 12]. The parameters P (c j ) can be calculated by estimating the number of elements in each of the layers of the merged context graph. While useful when the layers do not contain excessive numbers of nodes, as previously stated practical limitations sometimes prevent the storage of all documents in ....
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, "Text classification from labeled and unlabelled documents using EM." To appear in Machine Learning, 1999.
....the category other . To solve this problemweuseamodificationoftheNaive Bayes Classifier for each layer. This classifier architecture provides reasonable performance, high speed, meets the requirement of our system that a likelihood estimate be provided for each classification, and is well studied [17, 18, 12]. Assume that we have a document d i represented by the vector corresponding to the reduced TF IDF representation relative to the vocabulary V . Documents from class c j ,defined to correspond to layer j, are assumed to have a prior probability of being found on the web which we denote P#c j #. ....
.... for all the words in the documents of class c j : P#w t #c j ## 1 # d i #D j N#w t # d i #P#c j #d i # #V # # d i #D j #V # s#1 N#w s # d i #P#c j #d i # (4) where N#w t # d i # is the number of occurrences of w t in the document d i and #V # is the number of phrases in the vocabulary V [17, 18, 12]. The parameters P#c j # can be calculated by estimating the number of elements in each of the layers of the merged context graph. While useful when the layers do not contain excessive numbers of nodes, as previously stated practical limitations sometimes prevent the storage of all documents in ....
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, "Text classification from labeled and unlabelled documents using EM." To appear in Machine Learning, 1999.
.... of identifying characteristics are described in Winkler (1993b) EM methods and ideas for dealing with one major type of nonhomogeneity similar to Winkler (1988, 1989, 1993b) have recently been applied to the general problem of text classification in machine learning and data mining by Nigam et al. 1999). The methods of Winkler are more general because they allow for dependencies of fields and convex constraints on probabilities (either class or marginal) that predispose estimates to subregions of the parameter based on prior knowledge from similar matching situations. 2.1 String Comparators In ....
....use of the EM probabilities is needed because the EM may not exactly divide the set of pairs into two classes that correspond exactly to matches and nonmatches. The difficulty of having EM determined classes that correspond to true matching classes has been addressed by Winkler (1993b) and by Nigam et al. 1999). The caution may not apply to conventionally estimated parameters because the clerical review can better assure that estimated parameters are consistent with model assumptions. The EM probabilities are estimated using all pairs and often used in matching software that forces 1 1 matching. ....
[Article contains additional citation context not shown here]
Nigam, K., A. K. McCallum, S. Thrun, and T. Mitchell (1999), "Text Classification from Labeled and Unlabelled Documents using EM, Machine Learning, to appear.
No context found.
K. Nigam, A. McCallum, S. Thrun and T. Mitchell: "Text Classification from Labeled and Unlabelled Document using EM," Machine Learning, Vol. 39, No. 2/3, pp. 103-134, 2000.
No context found.
K. Nigam, A. McCallum, S. Thrun and T. Mitchell: "Text Classification from Labeled and Unlabelled Document using EM," Machine Learning, Vol. 39, No. 2/3, pp. 103-134, 2000.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC