| G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983. |
....but e ective Naive Bayes algorithm to the more computationally demanding Support Vector Machines [24, 10, 29] A common, and often overwhelming, characteristic of text data is its extremely high dimensionality. Typically the document vectors are formed using a vector space or bag ofwords model[26]. Even a moderately sized document collecPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full ....
....p, q, p1 , p2 , etc. when the random variable is obvious or by p(X) p(Cjw t ) etc. to make the random variable explicit. 2. RELATED WORK Text classi cation has been extensively studied, especially since the emergence of the internet. Most algorithms are based on the bag of words model for text [26]. A simple but e ective algorithm is the Naive Bayes method [24] For text classi cation, di erent variants of Naive Bayes have been used, but McCallum and Nigam [21] showed that the variant based on the multinomial model leads to better results. For hierarchical text data, such as the topic ....
G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
....1. Introduction Clustering or grouping document collections into conceptually meaningful clusters is a well studied problem. A starting point for applying clustering algorithms to unstructured document collections is to create a vector space model, alternatively known as a bag of words model [16]. The basic idea is (a) to extract unique content bearing words from the set of documents treating these words as features and (b) to then represent each document as a vector of certain weighted word frequencies in this feature space. Typically, a large number of words exist in even a moderately ....
.... however this distance measure is often inappropriate for its application to document clustering [18] An effective measure of similarity between documents, and one that is often used in information retrieval, is cosine similarity, which uses the cosine of the angle between document vectors [16]. The k means algorithm can be adapted to use the cosine similarity metric to yield the spherical k means algorithm, so named because the algorithm operates on vectors that lie on the unit sphere [4] Since it uses cosine similarity, spherical k means exploits the sparsity of document vectors and ....
G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
.... but e ective NaiveBayes algorithm to the more computationally demanding Support Vector Machines [24, 10, 29] A common, and often overwhelming, characteristic of text data is its extremely high dimensionality.Typically the document vectors are formed using a vector space or bag ofwords model[26]. Even a moderately sized document collecPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full ....
....p, q, p# , p# , etc. when the random variable is obvious or by p(X) p(C#w # ) etc. to make the random variable explicit. 2. RELATED WORK Text classi cation has been extensively studied, especially since the emergence of the internet. Most algorithms are based on the bag of words model for text [26]. A simple but e ective algorithm is the NaiveBayes method [24] For text classi cation, di erent variants of Naive Bayes have been used, but McCallum and Nigam [21] showed that the variant based on the multinomial model leads to better results. For hierarchical text data, such as the topic ....
G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
....ers on the 20 News groups data set and 5000 HTML documents collected from Dmoz Open Directory. 1. Introduction A common, and often overwhelming, characteristic of text data is its extremely high dimensionality. Typically the document vectors are formed using a vectorspace or bag of words model [26]. Even a moderately sized document collection can lead to a dimensionality in thousands. This high dimensionality can be a severe obstacle for classi cation algorithms based on Support Vector Machines, Linear Discriminant Analysis, k nearest neighbor etc. The problem is compounded when the ....
G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
....future navigation and search. Document clustering is a widely studied problem and many algorithms have been proposed for this task. A starting point for applying clustering algorithms to document collections is to create a vector space model, alternatively known as a bag of words model [32]. The basic idea is (a) to extract unique content bearing words from the set of documents treating these words as features and (b) to then represent each document as a vector of certain weighted word frequencies in this feature space. Thus the entire document collection may be treated as a ....
....weights on the edges, we can capture the degree of this association. One possibility is to have edge weights equal term frequencies, i.e. the number of times a word occurs in a document. In fact, most of the term weighting formulae used in information retrieval may be used as edge weights, see [31, 32, 26] for more details. One popular term weighting scheme is to have the edge weight E ij associated with the edge fw i ; d j g be E ij = t ij log jDj jD i j 4 where t ij is the number of times word w i occurs in document d j , jDj = n is the total number of documents and jD i j is the ....
G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
....1. Introduction Clustering or grouping document collections into conceptually meaningful clusters is a wellstudied problem. A starting point for applying clustering algorithms to unstructured document collections is to create a vector space model, alternatively known as a bag of words model [17]. The basic idea is (a) to extract unique content bearing words from the set of documents treating these words as features and (b) to then represent each document as a vector of certain weighted word frequencies in this feature space. Typically, a large number of words exist in even a moderately ....
.... this distance measure is often inappropriate for its application to clustering a collection of documents [21] An e ective measure of similarity between documents, and one that is often used in information retrieval, is cosine similarity, which uses the cosine of the angle between document vectors [17]. The k means algorithm can be adapted to use the cosine similarity metric, see [16] to yield the spherical k means algorithm, so named because the algorithm operates on vectors that lie on the unit sphere [4] Since it uses cosine similarity, spherical k means exploits the sparsity of document ....
G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
....but e ective Naive Bayes algorithm to the more computationally demanding Support Vector Machines [24, 30, 31] A common, and often overwhelming, characteristic of text data is its extremely high dimensionality. Typically the document vectors are formed using a vector space or bag of words model [26]. Even a moderately sized document collection can lead to a dimensionality in thousands, for example, one of our test data sets contains 5,000 web pages from www.dmoz.org and has a dimensionality (vocabulary size) of 14,538. This high dimensionality can be a severe obstacle for classi cation ....
....by p, q, p 1 , p 2 , etc. when the random variable is obvious or by p(X) p(Cjw t ) to make the random variable explicit. 2 Related Work Text classi cation has been extensively studied, especially since the emergence of the internet. Most algorithms are based on the bag of words model for text [26]. A simple but e ective algorithm is the Naive Bayes method [24] For text classi cation, di erent variants of Naive Bayes have been used, but McCallum and Nigam [21] showed that the variant based on the multinomial model leads to better results. For hierarchical text data, such as the topic ....
G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
.... stopwords such as a , and , the , etc. For sample lists of stopwords, see [5, Chapter 7] 3. For each document, count the number of occurrences of each word. 4. Using heuristic or information theoretic criteria, eliminate non content bearing high frequency and low frequency words [8]. 5. After the above elimination, suppose w unique words remain. Assign a unique identi er between 1 and w to each remaining word, and a unique identi er between 1 and d to each document. The above steps outline a simple preprocessing scheme. In addition, one may extract word phrases such as New ....
G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
.... assigning class labels to the data and has been widely studied in statistical pattern recognition and machine learning [DH73, Mit97] A starting point for applying clustering algorithms to unstructured text data is to create a vector space model, alternatively known as a bagof words model [SM83] The basic idea is (a) to extract unique contentbearing words from the set of documents treating these words as features and (b) to then represent each document as a vector of certain weighted word frequencies in this feature space. Observe that we may regard the vector space model of a text ....
....lists of stopwords, see [FBY92, Chapter 7] 3 For each document, count the number of occurrences of each word. Ecient Clustering of Very Large Document Collections 5 4 Using heuristic or information theoretic criteria, eliminate noncontent bearing high frequency and low frequency words [SM83] 5 After the above elimination, suppose w unique words remain. Assign a unique identi er between 1 and w to each remaining word, and a unique identi er between 1 and d to each document. The above steps outline a simple preprocessing scheme. In addition, one may extract word phrases such as ....
[Article contains additional citation context not shown here]
G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
....future navigation and search. Document clustering is a widely studied problem and many algorithms have been proposed for this task. A starting point for applying clustering algorithms to document collections is to create a vector space model, alternatively known as a bag of words model [32]. The basic idea is (a) to extract unique content bearing words from the set of documents treating these words as features and (b) to then represent each document as a vector of certain weighted word frequencies in this feature space. Thus the entire document collection may be treated as a ....
....weights on the edges, we can capture the degree of this association. One possibility is to have edge weights equal term frequencies, i.e. the number of times a word occurs in a document. In fact, most of the term weighting formulae used in information retrieval may be used as edge weights, see [31, 32, 26] for more details. One popular term weighting scheme is to have the edge weight E ij associated with the edge fw i ; d j g be E ij = t ij log jDj jD i j ; 4 where t ij is the number of times word w i occurs in document d j , jDj = n is the total number of documents and jD i j is the ....
G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
.... 1992, Hearst and Pedersen, 1996, Sahami et al. 1999, Schutze and Silverstein, 1997, Silverstein and Pedersen, 1997, Vaithyanathan and Dom, 1999, Zamir and Etzioni, 1998] A starting point for applying clustering algorithms to unstructured text data is to create a vector space model for text data [Salton and McGill, 1983]. The basic idea is (a) to extract unique content bearing words from the set of documents and treat these words as features and (b) to represent each document as a vector of certain weighted word frequencies in this feature space. Typically, a large number of words exist in even a moderately sized ....
....L 2 norm, that is, they can be thought of as points on a high dimensional unit sphere. Such normalization mitigates the effect of differing lengths of documents [Singhal et al. 1996] It is natural to measure similarity between such vectors by their inner product, known as cosine similarity [Salton and McGill, 1983]. In this paper, we will use a variant of the well known Euclidean k means algorithm [Duda and Hart, 1973, 1 Hartigan, 1975] that uses cosine similarity [Rasmussen, 1992] We shall show that this algorithm partitions the high dimensional unit sphere using a collection of great hypercircles, ....
[Article contains additional citation context not shown here]
Salton, G. and McGill, M. J. (1983). Introduction to Modern Retrieval. McGraw-Hill Book Company.
.... The data may naturally occur in this form, or vectorspace models of the underlying data may be constructed, e.g. voice, images or text documents may be treated as vectors in a multidimensional feature space, see [Fanty and Cole, 1991] Alimoglu and Alpaydin, 1996] Flickner et al. 1995] and [Salton and McGill, 1983]. We will also not worry about how the data is classified the classification may be done manually as in the Yahoo hierarchy, or may be obtained by a clustering method such as the k means or vector quantization algorithms [Duda and Hart, 1973, Hartigan, 1975, Gray and Neuhoff, 1998] We assume ....
Salton, G. and McGill, M. J. (1983). Introduction to Modern Retrieval. McGraw-Hill Book Company.
.... 1992; Hearst and Pedersen, 1996; Sahami et al. 1999; Schtze and Silverstein, 1997; Silverstein and Pedersen, 1997; Vaithyanathan and Dom, 1999; Zamir and Etzioni, 1998) A starting point for applying clustering algorithms to unstructured text data is to create a vector space model for text data (Salton and McGill, 1983). The basic idea is (a) to extract unique content bearing words from the set of documents and treat these words as features and (b) to represent each document as a vector of certain weighted word frequencies in this feature space. Observe that we may regard the vector space model of a text data ....
....L 2 norm, that is, they can be thought of as points on a high dimensional unit sphere. Such normalization mitigates the effect of differing lengths of documents (Singhal et al. 1996) It is natural to measure similarity between such vectors by their inner product, known as cosine similarity (Salton and McGill, 1983). In this paper, we will use a variant of the well known Euclidean k means algorithm (Duda and Hart, 1973; Hartigan, 1975) that uses cosine similarity (Rasmussen, 1992) We shall show that this algorithm partitions the highdimensional unit sphere using a collection of great hypercircles, and ....
[Article contains additional citation context not shown here]
Salton, G. and M. J. McGill: 1983, Introduction to Modern Retrieval. McGraw-Hill Book Company.
....learning and statistical algorithms such as clustering, classification, principal component analysis, and discriminant analysis to text data sets is of great practical interest. A starting point for applying such algorithms to unstructured text data is to create a vector space model for text data [Salton and McGill, 1983]. The basic idea is (a) to extract unique content bearing words from the set of documents and treat these words as features and (b) to represent each document as a vector of certain weighted word frequencies in this feature space. Typically, a large number of words exist in even a moderately sized ....
....is, they can be thought of as points on a high dimensional unit sphere. Such normalization mitigates the effect of differing lengths of documents [Singhal et al. 1996] It is natural to measure similarity between such 1 vectors by the inner product known as cosine similarity between them [Salton and McGill, 1983]. In this paper, we will use a variant of the well known Euclidean k means algorithm [Duda and Hart, 1973, Hartigan, 1975] that uses cosine similarity [Rasmussen, 1992] We shall show that this algorithm partitions the high dimensional unit sphere using a collection of great hypercircles, and ....
[Article contains additional citation context not shown here]
Salton, G. and McGill, M. J. (1983). Introduction to Modern Retrieval. McGraw-Hill Book Company.
....the document, the out links 2 originating at the document, and the in links terminating at the document, respectively. We now show how to compute these triplets for each document in Q. Words The creation of the first component D is a standard exercise in text mining or information retrieval, see [21]. The basic idea is to construct a word dictionary of all the words that appear in any of the documents in Q, and to prune or eliminate function words from this dictionary that do not help in semantically discriminating one cluster from another. For the present application, we eliminated those ....
....numbers such that ff d ff f ff b = 1: Observe that for any two document triplets x and x, 0 S(x; x) 1. Also, observe that if we set ff d = 1, ff f = 0, and ff b = 0, then we get the classical cosine similarity between document vectors that has been widely used in information retrieval [21]. The parameters ff d , ff f , and ff b are tunable in our algorithm to assign different weights to words, outlinks, and in links as desired. We will later discuss, in detail, the appropriate choice of these parameters. Concept Triplets Suppose we are given n document vector triplets x 1 ; x 2 ; ....
Salton, G., and McGill, M. J. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
....Euclidean space R d , and that proximity in R d implies similarity. The data may naturally occur in this form, or vectorspace models of the underlying data may be constructed, for example, voice, images or text documents may be treated as vectors in a multidimensional feature space, see [2, 3, 4, 5]. We will also not worry about how the data is classified the classification may be done manually as in the Yahoo hierarchy, or may be obtained by clustering methods such as the k means or vector quantization algorithms [6, 7, 8] We assume that we know the representation of all the data ....
G. Salton and M. J. McGill, Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
....classes. We will assume that the data is embedded in highdimensional Euclidean space R d . The data may naturally occur in this form, or vector space models of the underlying data may be formed, e.g. text documents or images may be treated as vectors in a multidimensional feature space, see [Salton and McGill, 1983] and [Flickner et al. 1995] We will also not worry about how the data is classified the classification may be done manually as in the Yahoo hierarchy, or may be obtained by a clustering method such as the k means algorithm [Duda and Hart, 1973, Hartigan, 1975] We assume that we know the ....
....text dataset, which we first describe. The raw data is in the form of 1412 book reviews, each of which is a short text document containing about 25 100 words. For our treatment, we transform this data into a vector space model , where each document is expressed as a numerical vector (of words) [Salton and McGill, 1983]. This is accomplished by the following steps: 1. convert all words to lower case; 2. for each document, count the number of occurrences of each word; 3. discard (a) common words, such as and , the , b) uncommon words, such as sioux , and (c) one and two letter words, such as be and to ; 4. ....
[Article contains additional citation context not shown here]
Salton, G. and McGill, M. J. (1983). Introduction to Modern Retrieval. McGrawHill Book Company.
No context found.
G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
No context found.
G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
No context found.
G. Salton and M. J. McGill, Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
No context found.
G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
No context found.
Salton, G., and McGill, M. J. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC