| Marti A. Hearst and Jan O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of the 19th Annual International ACM/SIGIR Conference, pages 76-- 84, Zurich, Switzerland, 1996. |
....on the basis of similarities rather than on a predefined set of categories. Several clustering based projects are currently underway, but since most barely touch on the issue of cataloging, which is the crux of this article, we will cite only a few: Zamir Etzioni [18] 19] Hearst and Pedersen [8], and Sahami et al. 14] Classification is a third means of organizing documents into groups. It applies statistical techniques to documents for which a category has been defined by other means. The system learns the behavior of the documents with respect to the defined categories, and enables ....
Hearst, M., Pedersen, J., Reexamining the cluster hypothesis: Scatter/Gather on retrieval results, Proceedings of 19 annual international ACM/SIGIR conference (Zurich, Switzerland, August 1996), ACM Press (1996) 76-84.
....results [19] Note that even if only the top ten keywords are used for the clustering and document representation, we might still display more keywords on the screen to assist the user in his or her search. 2.3. Document Clustering Post retrieval document clustering has been well studied, eg [9, 1, 15, 10, 31]. We deploy a variant of the Buckshot algorithm [9] Each cluster contains a certain number of document vectors and is represented by their normalized arithmetic mean, the so called centroid vector. In the first phase, hierarchical clustering with complete linkage operates on the best ranked 150 ....
M Hearst and J Pedersen. Reexamining the cluster hypothesis: scatter/gather on retrieval results. In SIGIR, 1996.
.... in the collection, but this has never actually been proved [125] The cluster hypothesis states that closely associated documents tend to be relevant to the same requests [118, Ch.3] or alternatively relevant documents tend to be more similar to each other than to non relevant documents [52]. The Scatter Gather system [32] has shown that clustering can be employed to support browsing. It partitions a document collection into clusters, and then presents a list of these to the user, with representative terms and document titles from each. The user can select any number of the ....
....with a new list. Pirolli and his colleagues [90] found that this system helped its users to get an idea of what the collection contained, although when searching, querying was more effective than pure directed browsing as a means of locating relevant documents. A later study by Hearst and Pedersen [52] found that, in theory, clustering the results of a query (rather than the whole collection) using Scatter Gather should be helpful for locating the relevant documents, as most of them are usually placed in the same cluster, in accordance with the cluster hypothesis. Kural, Robertson, and Jones ....
[Article contains additional citation context not shown here]
M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR'96, pages 76-- 84. ACM, 1996.
....full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and or a fee. SIGIR 02, August 11 15, 2002, Tampere, Finland. Copyright 2002 ACM 1 58113 561 0 02 0008. 5.00. and viewing of retrieval results [6], to accelerate nearest neighbor search [1] and to generate Yahoo like hierarchies [12] Common characteristics of document clustering include: there is a large number of documents to be clustered; the number of output clusters may be large; each document has a large number of features; ....
....unintuitive results. Suppose the answer key consists of 20 equally sized classes with 1000 elements in each. Treating each element as its own cluster gets a misleadingly high score of 95 . The evaluation of document clustering algorithms in information retrieval often uses the embedded approach [6]. Suppose we cluster the documents returned by a search engine. Assuming the user is able to pick the most relevant cluster, the performance of the clustering algorithm can be measured by the average precision of the chosen cluster. Under this scheme, only the best cluster matters. The entropy ....
Hearst, M. A. and Pedersen, J. O. 1996. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of $1G1R-96. pp. 76 84. Zurich, Switzerland.
.... the precision and recall of information retrieval systems [14] Because clustering is often too slow for large corpora and has indifferent performance [7] document clustering has been used more recently in document browsing [3] to improve the organization and viewing of retrieval results [5], to accelerate nearest neighbor search [T] and to generate Yahoolike hierarchies [T0] In this paper, we propose a clustering algorithm, CBC (Clustering By Committee) which produces higher quality clusters in document clustering tasks as compared to several well known clustering algorithms. ....
....unintuitive results. Suppose the answer key consists of 20 equally sized classes with 1000 elements in each. Treating each element as its own cluster gets a misleadingly high score of 95 . The evaluation of document clustering algorithms in information retrieval often uses the embedded approach [5]. Suppose we cluster the documents returned by a search engine. Assuming the user is able to pick the most relevant cluster, the performance of the clustering algorithm can be measured by the average precision of the chosen cluster. Under this scheme, only the best cluster matters. The entropy ....
Hearst, M. A. and Pedersen, J. O. 1996. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings ofSIGIR-96. pp. 7(84. Zurich, Switzerland.
....same query and user) than documents found outside of the cluster. Experimental systems have been developed which used document clustering to expand retrieval results to other relevant documents [42] 43] 44] Evaluation and user testing have supported the conclusions of the cluster hypothesis [45]. Based on these results, patterns that co occur in documents within the cluster may be more meaningful (i.e. related to the content of the cluster) than those patterns derived from the entire collection. Further, partitioning the document set into smaller groups, we can more ecently search for ....
M.A. Hearst and J.O. Pedersen, \Reexamining the cluster hypothesis: Scatter/gather on retrieval results," in Proceedings of the 19th Annual International SIGIR Conference, Zurich, Switzerland, 1996.
....results clustering. However, their computational complexity, difficult tuning of parameters and sensitivity to malicious input data soon raised the need of improvements. The first proposal for on line and dynamic search results clustering was per haps presented in the Scatter Gather system [5], where non hierarchical, par titioning Fractionation algorithm was used. Undesired and troublesome high dimensionality of term frequency vectors was addressed in [6] where two deriva tions of graph partitioning were presented. Simple terms were replaced with lex ical affinities (pairs of ....
Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In: Proceedings of SIGIR-96, 19th ACM International Confer- ence on Research and Development in Information Retrieval, Ziirich, CH (1996) 76 84
....Both search engines put search results into folders, each of which represents a subtopic. In document clustering, there are in general two approaches. In the first, documents are categorized based on individual document attributes. An attribute might be the query term s frequency in each document [14, 29]. NorthernLight is an example of this approach. The retrieved documents are organized based on the size, source, topic or author of each document. Other examples include Envision [ 11 ] and GRIDL [26] In the second approach, documents are classified based on inter document similarities. This ....
Hearst, M. A. and Pedersen, J. Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results, in Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), 76-84 (1996).
....only. From the experimental results on several cases, this is an e ective way to represent the large number of webpages for a user query. Document clustering has been studied by many people (see [18] and references there) Clustering retrieval results have been examined by Hearst and Paderson [10] based on textual information only, with emphasis on summarization. More recently this approach is taken in Grouper web interface[19] Exploring web link structures in the information retrieval context to identify topical themes is examined by Larson[13] Pirolli, et al. 15] Most recently, Text ....
M. A. Hearst and J. O. Paderson. Re-examining the cluster hypothesis: Scatter/gather on retrieval results. Proc. SIGIR'96, 1996.
....manipulation functionality has been the issue of many researchers before the advent of Web based information retrieval modules. Based on the cluster hypothesis of van Rijsbergen [26] Bead [3] and Lyberworld [5] are two systems that provide extended functionality to search results. Scatter Gather [4] goes one step further by observating that the same sets of documents behave differently in different contexts . The idea behind this system is to cluster retrieval results in k clusters, scatter the documents inside and then partition the document set into another k clusters. Some other systems ....
Hearst, M.A., Pedersen, J.O.: Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. in Proceedings of SIGIR '96 (Zurich, August 1996), 76-84
....data mining. The first groupIng systems were based on applications of well known algorithms like Hierarchical Agglomerative ClusterIng (HAC) K means and various other groupIng techniques. One of the pioneer implementations of search results clusterIng was an application of Scatter Gather system [Hearst and Pedersen, 96] to this problem. Post retrieval clustering system requirements, source: Zamir and Etzioni, 99] The progress, which has been made since then is In realization that new classes of algorithms are needed to fulfill the post retrieval document clustering system requirements, given In [Zamir ....
Hearst M. A., Pedersen J. O.: Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results, In Proceedings of the Nineteenth Annual International ACM SIGIR Conference, Zurich, June 1996. Scatter/gather interface and algorithm explained.
....which match almost perfectly the existing topics of the corpus. 1. MOTIVATION Unsupervised document clustering is a central problem in information retrieval. Possible applications includes use of clustering for improving retrieval [19] and for navigating and browsing large document collections [3, 6, 20]. Several recent works suggest to use clustering techniques for unsupervised document classi cation [15, 5, 17] In this task, we are given a collection of unlabeled documents and requested to nd clusters that are highly correlated with the true topics of the documents. This practical situation ....
M. A. Hearst and J. O. Pedersen. Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. In ACM SIGIR 96, pages 76-84, 1996.
....on the basis of similarities rather than on a predefined set of categories. Several clustering based projects are currently underway, but since most barely touch on the issue of cataloging, which is the crux of this article, we will cite only a few: Zamir Etzioni [18] 19] Hearst and Pedersen [8], and Sahami et al. 14] Classification is a third means of organizing documents into groups. It applies statistical techniques to documents for which a category has been defined by other means. The system learns the behavior of the documents with respect to the defined categories, and enables ....
Hearst, M., Pedersen, J., Reexamining the cluster hypothesis: Scatter/Gather on retrieval results, Proceedings of 19 annual international ACM/SIGIR conference (Zurich, Switzerland, August 1996), ACM Press, 1996, 76-84.
....approaches have been developed in recent years. Generally these visualizations are designed to present some type of patterns in a document set and they are considered to be browsing interfaces. The format of the presentation varies significantly from system to system. For example, Hearst et al. [13] suggest a clustering system that groups the retrieved documents into five (or another preselected number) clusters and displays them simultaneously as lists of titles. A similar presentation was developed by Leuski and Croft [15] however they do not limit the number of clusters and their display ....
Marti A. Hearst and Jan O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of A CM SIGIR, pages 76-84, August 1996.
.... include agglomerative clustering[40, 39, 29] the partitional k means algorithm[8] projection based methods including LSA[2, 33] selforganizing maps[25, 21] and multidimensional scaling[27, 22] For computational eciency required in on line clustering, hybrid approaches have been considered in[7, 19]. Recently there has been a urry of activity in document clustering[3, 8, 30, 42] Graph theoretic techniques have also been considered for clustering; many earlier hierarchical agglomerative clustering algorithms[10] and some recent work[4, 37] model the similarity between documents by a graph ....
M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In ACM SIGIR, pages 76-84, 1996.
....decades. Willett [31] gives an excellent overview of the existing algorithms and applications. The use of clustering is based mostly on the Cluster Hypothesis: closely associated documents tend to be relevant to the same requests [29, p. 45] Croft [6] and more recently Hearst and Pedersen [13], showed that the Cluster Hypothesis also holds in a retrieved set of documents. However, they did not study how the clustering structure may help a user to find relevant information more quickly. In contrast to those studies Voorhees [30] could not find any conclusive support for the Cluster ....
....may help a user to find relevant information more quickly. In contrast to those studies Voorhees [30] could not find any conclusive support for the Cluster Hypothesis. Numerous studies and anecdotal evidence hint that document clustering can be a better way of organizing the retrieval results [13, 25, 16]. However, we could not find any strong experimental results that support this assumption. In this paper we describe a set of experiments that show the clustering to be a much more e#ective way of directing a user towards relevant documents among the retrieved set than the ranked list. 1 31. ....
[Article contains additional citation context not shown here]
M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of ACM SIGIR, pages 76--84, Aug. 1996.
No context found.
Marti A. Hearst and Jan O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of the 19th Annual International ACM/SIGIR Conference, pages 76-- 84, Zurich, Switzerland, 1996.
No context found.
M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of SIGIR, pages 76--84, Z urich, CH, 1996.
No context found.
Hearst M. A., Pedersen J. O. Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), Zurich, June 1996.
....back to the bipartite graph, 0) v u) Oc z Snbstituting into Eq. 24) we have we have 2 . 25) o g 1 D,I 2t3D 1 2. 27) The solutions to Eq. 26) are SVD of (that SVD is tile solution to Eq.24 for bipa.rtite graph is noted ea.rlier[22, 3] We emphasize that Eq. 26) is identical Eq. 8) wkh tte correspondence relationship (see also the similarity between Eq. 23) and Eq. 7) Therm fbre, the net effect of MinMa.xCm of Eq. 17) over the simple MinCut objective Eq. 6) or Eq. 16) is the scaling of the association matrix B in Eq. 27) However, with this scaling, the ....
....a. concrete K 3 example. The solutions to Eq. 24) are ml 2 0 r22 er2 xO ) 1 0 x(2) 1 0 1 2 0 c22 c2 0 0 e[c. Here D. pq diag(Bpqe, q) p, q 1, K) e. q e wkh the size of p th row block; Dcp q dia,g(pqecq) ecq e wi[h [he size of p th column block; and Spq 8(BRp,Cq) Note that 8pq 8qp. Let X (x( x( r, r any C dim vector y (V( q= D1 2Xy = Y(K) erK (2)l 2 (33) is an eigenvector of Eq. 23) Now any K orthonormal y, y leads to K eigcnvectors q, q Q. Figure 2: Left top: adjaccncy matrix of a bipartitc graph of ....
[Article contains additional citation context not shown here]
M. A. Hearst and J. O. Paderson. Re-examining the cluster hypothesis: Scatter/gather on retrieval results. Prvc. SIGIR '96, 1996.
....set is another way to deal with ambiguity: ideally documents covering di#erent senses of a word will be placed in di#erent clusters. Much work has been done on clustering, either investigating the clustering directly [1, 2, 14] or exploring issues related to clustering and interactive search [13, 18, 23]. The Web service Northern Light classifies returned documents into a set of labeled clusters, showing the clusters as well as the top ranked documents. All of these clustering techniques group documents by topic rather than by the way that the query word is used. A limited number of studies ....
Marti Hearst and Jan Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of the ACM Conference on Research in Information Retrieval (SIGIR), pages 76--84, 1996.
No context found.
Hearst, M., and Pedersen, J.O.: Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. In Proceedings of 19 th Annual International ACM/SIGIR Conference. ACM Press (1996) 76-84
....unstructured nature. Researchers have developed many different techniques to address this challenging problem of locating relevant Web information efficiently. Examples of such techniques include Web search engines, meta searching, post retrieval analysis, and enhanced Web collection visualization [10,18,36,44,48]. A major problem with most such techniques is that they do not facilitate user collaboration, which has potential for greatly improving Web search quality and efficiency. Without collaboration, users must start from scratch every time they perform a search task, even if other users have done ....
M.A. Hearst, J.O. Pedersen, Reexamining the cluster hypothesis: scatter/gather on retrieval results, Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), 1996, pp. 76 -- 84.
....gap in web document retrieval. We intend to use both textual keywords and image features in an attempt to discover the latent semantic structure of web documents and to correlate keywords with image features. There have been various papers concerned with transforming web pages into concepts [9, 16, 39]. These papers show how to transform the set of pages returned by a standard search engine into a more browsable representation through the mediation of clustering, each cluster corresponding to one of the concepts. An important aspect of our study is to bring multimedia information into the ....
M. A. Hearst and J. O. Pedersen, Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results, Proceedings of the 19 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, August 1996, pp. 76-84.
....in TREC [15] It is a good example of how the overall performance of both the user and the system working together is measured. User studies are also applied to evaluate some particular aspects of the system. For example, in their user study of the Scatter Gather system, Hearst and Pedersen [16] showed that users seem able to choose the cluster with the largest number of relevant documents using the textual summaries the system creates. User studies usually are very expensive, time consuming and difficult to execute. Designing a good and informative user study is almost work of art. A ....
....recent years. Generally these visualizations are designed to present some type of patterns in a document set and are considered to be browsing interfaces. To our knowledge there have been no studies on how such visualizations help the user locate relevant information. The Scatter Gather interface [16] presents the document clusters as text. It groups the documents into five (or any preselected number) clusters and displays them simultaneously as lists. On a large enough screen, the top several documents from each cluster are clearly visible. Another textbased visualization is presented by ....
M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of ACM SIGIR, pages 76--84, Aug. 1996.
No context found.
Marti A. Hearst, Jan O. Pedersen, "Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results", Proceedings of ACMSIGIR 96, pp.76-84, 1996.
No context found.
M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. SIGIR-96, pages 76--84, Zurich, CH, ACM Press, 1996.
No context found.
Hearst, M. & Pedersen, J. (1996). Reexamining the cluster hypothesis: Scatter/Gather on retrieval results.
No context found.
M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In SIGIR-96.
No context found.
Hearst M. A. and Pedersen J. O. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of the 19 th ACM SIGIR conference, August 1996, Zurich, Switzerland (pp. 76-84).
No context found.
M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In SIGIR-96.
No context found.
M.A. Hearst and J.O. Pedersen, "Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results," Proc. of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), pp. 76--84, 1996.
No context found.
Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. Proceedings of SIGIR'96 (1996) 76-84
No context found.
M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, pages 76--84. ACM Press, New York, 1996.
No context found.
Marti A. Hearst and Jan O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, pages 76--84, Zurich, CH, 1996.
No context found.
M. A. Hearst and J. O. Pedersen, "Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results," Proc. of the 19th Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 76--84, 1996.
No context found.
Marti A. Hearst and Jan O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, pages 76--84, Zurich, CH, 1996.
No context found.
M. A. Hearst and J. O. Pedersen, Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results, Proc. ACM SIGIR, Zurich, 1996.
No context found.
M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: scatter/gather on retrieval results. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference, Zurich, June 1996.
No context found.
M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter /Gather on retrieval results. In Proceedings of the 19th International ACM 47 SIGIR Conference on Research and Development in Information Retrieval, pages 76-84, 1996.
No context found.
Marti A. Hearst and Jan O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, pages 76--84, Zurich, CH, 1996.
No context found.
#Hearst, M.A., Pedersen, J.O.: Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results, in ACM/SIGIR (1996), 76-84
No context found.
M. A. Hearst and J. O. Pedersen. Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. In ACM SIGIR 96, pages 76--84, 1996.
No context found.
Marti A. Hearst and Jan O. Pedersen. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR, pages 76-84, 1996.
No context found.
Marti A. Hearst and Jan O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, pages 76--84, Zurich, CH, 1996.
No context found.
Marti A. Hearst, Jan O. Pedersen, "Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results", Proceedings of ACM-SIGIR 96, pp.76-84, 1996.
No context found.
Marti Hearst and Jan O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of the 19th Annual ACM SIGIR Conference, pages 76--84, Zurich, 1996. ACM Press.
No context found.
Hearst, M. A. and Pedersen J. O. Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results. Proc. 19th ACM SIGIR Conference on Research and Development in Information Retrieval. 1996. 76-84
No context found.
Hearst, M., Pedersen, J., Reexamining the cluster hypothesis: Scatter/Gather on retrieval results, Proceedings of 19 annual international ACM/SIGIR conference (Zurich, Switzerland, August 1996), ACM Press, 1996, 76-84.
No context found.
Hearst, M. and Pedersen, J. Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results, in Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96), 76-84 (1996).
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC