22 citations found. Retrieving documents...
G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Enhanced Word Clustering for Hierarchical Text Classification - Dhillon, Mallela, Kumar   (4 citations)  (Correct)

....but e ective Naive Bayes algorithm to the more computationally demanding Support Vector Machines [24, 10, 29] A common, and often overwhelming, characteristic of text data is its extremely high dimensionality. Typically the document vectors are formed using a vector space or bag ofwords model[26]. Even a moderately sized document collecPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full ....

....p, q, p1 , p2 , etc. when the random variable is obvious or by p(X) p(Cjw t ) etc. to make the random variable explicit. 2. RELATED WORK Text classi cation has been extensively studied, especially since the emergence of the internet. Most algorithms are based on the bag of words model for text [26]. A simple but e ective algorithm is the Naive Bayes method [24] For text classi cation, di erent variants of Naive Bayes have been used, but McCallum and Nigam [21] showed that the variant based on the multinomial model leads to better results. For hierarchical text data, such as the topic ....

G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.


Iterative Clustering of High Dimensional Text Data.. - Inderjit Dhillon And (2002)   (Correct)

....1. Introduction Clustering or grouping document collections into conceptually meaningful clusters is a well studied problem. A starting point for applying clustering algorithms to unstructured document collections is to create a vector space model, alternatively known as a bag of words model [16]. The basic idea is (a) to extract unique content bearing words from the set of documents treating these words as features and (b) to then represent each document as a vector of certain weighted word frequencies in this feature space. Typically, a large number of words exist in even a moderately ....

.... however this distance measure is often inappropriate for its application to document clustering [18] An effective measure of similarity between documents, and one that is often used in information retrieval, is cosine similarity, which uses the cosine of the angle between document vectors [16]. The k means algorithm can be adapted to use the cosine similarity metric to yield the spherical k means algorithm, so named because the algorithm operates on vectors that lie on the unit sphere [4] Since it uses cosine similarity, spherical k means exploits the sparsity of document vectors and ....

G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.


Enhanced Word Clustering for Hierarchical Text Classification - Dhillon, Mallela, Kumar (2002)   (4 citations)  (Correct)

.... but e ective NaiveBayes algorithm to the more computationally demanding Support Vector Machines [24, 10, 29] A common, and often overwhelming, characteristic of text data is its extremely high dimensionality.Typically the document vectors are formed using a vector space or bag ofwords model[26]. Even a moderately sized document collecPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full ....

....p, q, p# , p# , etc. when the random variable is obvious or by p(X) p(C#w # ) etc. to make the random variable explicit. 2. RELATED WORK Text classi cation has been extensively studied, especially since the emergence of the internet. Most algorithms are based on the bag of words model for text [26]. A simple but e ective algorithm is the NaiveBayes method [24] For text classi cation, di erent variants of Naive Bayes have been used, but McCallum and Nigam [21] showed that the variant based on the multinomial model leads to better results. For hierarchical text data, such as the topic ....

G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.


Information Theoretic Feature Clustering for Text.. - Dhillon, Manella, Kumar   (4 citations)  (Correct)

....ers on the 20 News groups data set and 5000 HTML documents collected from Dmoz Open Directory. 1. Introduction A common, and often overwhelming, characteristic of text data is its extremely high dimensionality. Typically the document vectors are formed using a vectorspace or bag of words model [26]. Even a moderately sized document collection can lead to a dimensionality in thousands. This high dimensionality can be a severe obstacle for classi cation algorithms based on Support Vector Machines, Linear Discriminant Analysis, k nearest neighbor etc. The problem is compounded when the ....

G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.


Co-clustering documents and words using Bipartite Spectral Graph.. - Dhillon (2001)   (30 citations)  (Correct)

....future navigation and search. Document clustering is a widely studied problem and many algorithms have been proposed for this task. A starting point for applying clustering algorithms to document collections is to create a vector space model, alternatively known as a bag of words model [32]. The basic idea is (a) to extract unique content bearing words from the set of documents treating these words as features and (b) to then represent each document as a vector of certain weighted word frequencies in this feature space. Thus the entire document collection may be treated as a ....

....weights on the edges, we can capture the degree of this association. One possibility is to have edge weights equal term frequencies, i.e. the number of times a word occurs in a document. In fact, most of the term weighting formulae used in information retrieval may be used as edge weights, see [31, 32, 26] for more details. One popular term weighting scheme is to have the edge weight E ij associated with the edge fw i ; d j g be E ij = t ij log jDj jD i j 4 where t ij is the number of times word w i occurs in document d j , jDj = n is the total number of documents and jD i j is the ....

G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.


Refining Clusters in High Dimensional Text Data - Dhillon, Guan, Kogan (2002)   (Correct)

....1. Introduction Clustering or grouping document collections into conceptually meaningful clusters is a wellstudied problem. A starting point for applying clustering algorithms to unstructured document collections is to create a vector space model, alternatively known as a bag of words model [17]. The basic idea is (a) to extract unique content bearing words from the set of documents treating these words as features and (b) to then represent each document as a vector of certain weighted word frequencies in this feature space. Typically, a large number of words exist in even a moderately ....

.... this distance measure is often inappropriate for its application to clustering a collection of documents [21] An e ective measure of similarity between documents, and one that is often used in information retrieval, is cosine similarity, which uses the cosine of the angle between document vectors [17]. The k means algorithm can be adapted to use the cosine similarity metric, see [16] to yield the spherical k means algorithm, so named because the algorithm operates on vectors that lie on the unit sphere [4] Since it uses cosine similarity, spherical k means exploits the sparsity of document ....

G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.


Enhanced Word Clustering for Hierarchical Text Classification - Dhillon, Mallela, Kumar (2002)   (4 citations)  (Correct)

....but e ective Naive Bayes algorithm to the more computationally demanding Support Vector Machines [24, 30, 31] A common, and often overwhelming, characteristic of text data is its extremely high dimensionality. Typically the document vectors are formed using a vector space or bag of words model [26]. Even a moderately sized document collection can lead to a dimensionality in thousands, for example, one of our test data sets contains 5,000 web pages from www.dmoz.org and has a dimensionality (vocabulary size) of 14,538. This high dimensionality can be a severe obstacle for classi cation ....

....by p, q, p 1 , p 2 , etc. when the random variable is obvious or by p(X) p(Cjw t ) to make the random variable explicit. 2 Related Work Text classi cation has been extensively studied, especially since the emergence of the internet. Most algorithms are based on the bag of words model for text [26]. A simple but e ective algorithm is the Naive Bayes method [24] For text classi cation, di erent variants of Naive Bayes have been used, but McCallum and Nigam [21] showed that the variant based on the multinomial model leads to better results. For hierarchical text data, such as the topic ....

G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.


CS 395T Large-Scale Data Mining Fall 2001 - Lecture Lecturer Inderjit (2001)   (Correct)

.... stopwords such as a , and , the , etc. For sample lists of stopwords, see [5, Chapter 7] 3. For each document, count the number of occurrences of each word. 4. Using heuristic or information theoretic criteria, eliminate non content bearing high frequency and low frequency words [8]. 5. After the above elimination, suppose w unique words remain. Assign a unique identi er between 1 and w to each remaining word, and a unique identi er between 1 and d to each document. The above steps outline a simple preprocessing scheme. In addition, one may extract word phrases such as New ....

G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.


Efficient Clustering Of Very Large Document Collections - Dhillon, Fan, Guan (2001)   (6 citations)  (Correct)

.... assigning class labels to the data and has been widely studied in statistical pattern recognition and machine learning [DH73, Mit97] A starting point for applying clustering algorithms to unstructured text data is to create a vector space model, alternatively known as a bagof words model [SM83] The basic idea is (a) to extract unique contentbearing words from the set of documents treating these words as features and (b) to then represent each document as a vector of certain weighted word frequencies in this feature space. Observe that we may regard the vector space model of a text ....

....lists of stopwords, see [FBY92, Chapter 7] 3 For each document, count the number of occurrences of each word. Ecient Clustering of Very Large Document Collections 5 4 Using heuristic or information theoretic criteria, eliminate noncontent bearing high frequency and low frequency words [SM83] 5 After the above elimination, suppose w unique words remain. Assign a unique identi er between 1 and w to each remaining word, and a unique identi er between 1 and d to each document. The above steps outline a simple preprocessing scheme. In addition, one may extract word phrases such as ....

[Article contains additional citation context not shown here]

G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.


Co-clustering documents and words using Bipartite Spectral Graph.. - Dhillon (2001)   (30 citations)  (Correct)

....future navigation and search. Document clustering is a widely studied problem and many algorithms have been proposed for this task. A starting point for applying clustering algorithms to document collections is to create a vector space model, alternatively known as a bag of words model [32]. The basic idea is (a) to extract unique content bearing words from the set of documents treating these words as features and (b) to then represent each document as a vector of certain weighted word frequencies in this feature space. Thus the entire document collection may be treated as a ....

....weights on the edges, we can capture the degree of this association. One possibility is to have edge weights equal term frequencies, i.e. the number of times a word occurs in a document. In fact, most of the term weighting formulae used in information retrieval may be used as edge weights, see [31, 32, 26] for more details. One popular term weighting scheme is to have the edge weight E ij associated with the edge fw i ; d j g be E ij = t ij log jDj jD i j ; 4 where t ij is the number of times word w i occurs in document d j , jDj = n is the total number of documents and jD i j is the ....

G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.


Concept Decompositions for Large Sparse Text Data using.. - Dhillon, Modha (2001)   (35 citations)  (Correct)

.... 1992, Hearst and Pedersen, 1996, Sahami et al. 1999, Schutze and Silverstein, 1997, Silverstein and Pedersen, 1997, Vaithyanathan and Dom, 1999, Zamir and Etzioni, 1998] A starting point for applying clustering algorithms to unstructured text data is to create a vector space model for text data [Salton and McGill, 1983]. The basic idea is (a) to extract unique content bearing words from the set of documents and treat these words as features and (b) to represent each document as a vector of certain weighted word frequencies in this feature space. Typically, a large number of words exist in even a moderately sized ....

....L 2 norm, that is, they can be thought of as points on a high dimensional unit sphere. Such normalization mitigates the effect of differing lengths of documents [Singhal et al. 1996] It is natural to measure similarity between such vectors by their inner product, known as cosine similarity [Salton and McGill, 1983]. In this paper, we will use a variant of the well known Euclidean k means algorithm [Duda and Hart, 1973, 1 Hartigan, 1975] that uses cosine similarity [Rasmussen, 1992] We shall show that this algorithm partitions the high dimensional unit sphere using a collection of great hypercircles, ....

[Article contains additional citation context not shown here]

Salton, G. and McGill, M. J. (1983). Introduction to Modern Retrieval. McGraw-Hill Book Company.


Class Visualization of High-Dimensional Data with.. - Dhillon, Modha, Spangler (1999)   (1 citation)  (Correct)

.... The data may naturally occur in this form, or vectorspace models of the underlying data may be constructed, e.g. voice, images or text documents may be treated as vectors in a multidimensional feature space, see [Fanty and Cole, 1991] Alimoglu and Alpaydin, 1996] Flickner et al. 1995] and [Salton and McGill, 1983]. We will also not worry about how the data is classified the classification may be done manually as in the Yahoo hierarchy, or may be obtained by a clustering method such as the k means or vector quantization algorithms [Duda and Hart, 1973, Hartigan, 1975, Gray and Neuhoff, 1998] We assume ....

Salton, G. and McGill, M. J. (1983). Introduction to Modern Retrieval. McGraw-Hill Book Company.


Concept Decompositions for Large Sparse Text Data using.. - Dhillon, Modha (2000)   (35 citations)  (Correct)

.... 1992; Hearst and Pedersen, 1996; Sahami et al. 1999; Schtze and Silverstein, 1997; Silverstein and Pedersen, 1997; Vaithyanathan and Dom, 1999; Zamir and Etzioni, 1998) A starting point for applying clustering algorithms to unstructured text data is to create a vector space model for text data (Salton and McGill, 1983). The basic idea is (a) to extract unique content bearing words from the set of documents and treat these words as features and (b) to represent each document as a vector of certain weighted word frequencies in this feature space. Observe that we may regard the vector space model of a text data ....

....L 2 norm, that is, they can be thought of as points on a high dimensional unit sphere. Such normalization mitigates the effect of differing lengths of documents (Singhal et al. 1996) It is natural to measure similarity between such vectors by their inner product, known as cosine similarity (Salton and McGill, 1983). In this paper, we will use a variant of the well known Euclidean k means algorithm (Duda and Hart, 1973; Hartigan, 1975) that uses cosine similarity (Rasmussen, 1992) We shall show that this algorithm partitions the highdimensional unit sphere using a collection of great hypercircles, and ....

[Article contains additional citation context not shown here]

Salton, G. and M. J. McGill: 1983, Introduction to Modern Retrieval. McGraw-Hill Book Company.


Concept Decompositions for Large Sparse Text Data using.. - Dhillon, Modha (1999)   (35 citations)  (Correct)

....learning and statistical algorithms such as clustering, classification, principal component analysis, and discriminant analysis to text data sets is of great practical interest. A starting point for applying such algorithms to unstructured text data is to create a vector space model for text data [Salton and McGill, 1983]. The basic idea is (a) to extract unique content bearing words from the set of documents and treat these words as features and (b) to represent each document as a vector of certain weighted word frequencies in this feature space. Typically, a large number of words exist in even a moderately sized ....

....is, they can be thought of as points on a high dimensional unit sphere. Such normalization mitigates the effect of differing lengths of documents [Singhal et al. 1996] It is natural to measure similarity between such 1 vectors by the inner product known as cosine similarity between them [Salton and McGill, 1983]. In this paper, we will use a variant of the well known Euclidean k means algorithm [Duda and Hart, 1973, Hartigan, 1975] that uses cosine similarity [Rasmussen, 1992] We shall show that this algorithm partitions the high dimensional unit sphere using a collection of great hypercircles, and ....

[Article contains additional citation context not shown here]

Salton, G. and McGill, M. J. (1983). Introduction to Modern Retrieval. McGraw-Hill Book Company.


Clustering Hypertext With Applications To Web Searching - Modha, Spangler (2000)   (10 citations)  (Correct)

....the document, the out links 2 originating at the document, and the in links terminating at the document, respectively. We now show how to compute these triplets for each document in Q. Words The creation of the first component D is a standard exercise in text mining or information retrieval, see [21]. The basic idea is to construct a word dictionary of all the words that appear in any of the documents in Q, and to prune or eliminate function words from this dictionary that do not help in semantically discriminating one cluster from another. For the present application, we eliminated those ....

....numbers such that ff d ff f ff b = 1: Observe that for any two document triplets x and x, 0 S(x; x) 1. Also, observe that if we set ff d = 1, ff f = 0, and ff b = 0, then we get the classical cosine similarity between document vectors that has been widely used in information retrieval [21]. The parameters ff d , ff f , and ff b are tunable in our algorithm to assign different weights to words, outlinks, and in links as desired. We will later discuss, in detail, the appropriate choice of these parameters. Concept Triplets Suppose we are given n document vector triplets x 1 ; x 2 ; ....

Salton, G., and McGill, M. J. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.


Class Visualization of High-Dimensional Data with.. - Dhillon, Modha, Spangler (1999)   (1 citation)  (Correct)

....Euclidean space R d , and that proximity in R d implies similarity. The data may naturally occur in this form, or vectorspace models of the underlying data may be constructed, for example, voice, images or text documents may be treated as vectors in a multidimensional feature space, see [2, 3, 4, 5]. We will also not worry about how the data is classified the classification may be done manually as in the Yahoo hierarchy, or may be obtained by clustering methods such as the k means or vector quantization algorithms [6, 7, 8] We assume that we know the representation of all the data ....

G. Salton and M. J. McGill, Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.


Visualizing Class Structure of Multidimensional Data - Dhillon, Modha, Spangler (1998)   (5 citations)  (Correct)

....classes. We will assume that the data is embedded in highdimensional Euclidean space R d . The data may naturally occur in this form, or vector space models of the underlying data may be formed, e.g. text documents or images may be treated as vectors in a multidimensional feature space, see [Salton and McGill, 1983] and [Flickner et al. 1995] We will also not worry about how the data is classified the classification may be done manually as in the Yahoo hierarchy, or may be obtained by a clustering method such as the k means algorithm [Duda and Hart, 1973, Hartigan, 1975] We assume that we know the ....

....text dataset, which we first describe. The raw data is in the form of 1412 book reviews, each of which is a short text document containing about 25 100 words. For our treatment, we transform this data into a vector space model , where each document is expressed as a numerical vector (of words) [Salton and McGill, 1983]. This is accomplished by the following steps: 1. convert all words to lower case; 2. for each document, count the number of occurrences of each word; 3. discard (a) common words, such as and , the , b) uncommon words, such as sioux , and (c) one and two letter words, such as be and to ; 4. ....

[Article contains additional citation context not shown here]

Salton, G. and McGill, M. J. (1983). Introduction to Modern Retrieval. McGrawHill Book Company.


Journal of Machine Learning Research 3 (2003) 1265-1287.. - Algorithm For Text   (Correct)

No context found.

G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.


A Divisive Information-Theoretic Feature Clustering.. - Dhillon, Mallela, Kumar (2003)   (5 citations)  (Correct)

No context found.

G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.


Class Visualization of High-Dimensional Data with.. - Dhillon, Modha, Spangler (1999)   (1 citation)  (Correct)

No context found.

G. Salton and M. J. McGill, Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.


Iterative Clustering of High Dimensional Text Data.. - Dhillon, Guan, Kogan (2002)   (Correct)

No context found.

G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.


ChatTrack: Chat Room Topic Detection Using Classification - Bengel, Gauch, Mittur.. (2004)   (2 citations)  (Correct)

No context found.

Salton, G., and McGill, M. J. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC