Results 1 - 10
of
131
Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections
, 1992
"... Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); and second, that clustering does not appreciably ..."
Abstract
-
Cited by 519 (12 self)
- Add to MetaCart
Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); and second, that clustering does not appreciably improve retrieval. We argue that these problems arise only when clustering is used in an attempt to improve conventional search techniques. However, looking at clustering as an information access tool in its own right obviates these objections, and provides a powerful new access paradigm. We present a document browsing technique that employs document clustering as its primary operation. We also present fast (linear time) clustering algorithms which support this interactive browsing paradigm. 1 Introduction Document clustering has been extensively investigated as a methodology for improving document search and retrieval (see [15] for an excellent review). The general assumption is that mutua...
Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results
, 1996
"... We present Scatter/Gather, a cluster-based document browsing method, as an alternative to ranked titles for the organization and viewing of retrieval results. We systematically evaluate Scatter/Gather in this context and find significant improvements over similarity search ranking alone. This resul ..."
Abstract
-
Cited by 331 (5 self)
- Add to MetaCart
We present Scatter/Gather, a cluster-based document browsing method, as an alternative to ranked titles for the organization and viewing of retrieval results. We systematically evaluate Scatter/Gather in this context and find significant improvements over similarity search ranking alone. This result provides evidence validating the cluster hypothesis which states that relevant documents tend to be more similar to each other than to non-relevant documents. We describe a system employing Scatter/Gather and demonstrate that users are able to use this system close to its full potential. 1 Introduction An important service offered by an information access system is the organization of retrieval results. Conventional systems rank results based on an automatic assessment of relevance to the query [20]. Alternatives include graphical displays of interdocument similarity (e.g., [1, 22, 7]), relationship to fixed attributes (e.g., [21, 14]), and query term distribution patterns (e.g., [12]). I...
Information Filtering and Information Retrieval: Two Sides of the Same Coin
- COMMUNICATIONS OF THE ACM
, 1992
"... Information filtering systems are designed for unstructured or semistructured data, as opposed to database applications, which use very structured data. The systems also deal primarily with textual information, but they may also entail images, voice, video or other data types that are part of multim ..."
Abstract
-
Cited by 304 (5 self)
- Add to MetaCart
Information filtering systems are designed for unstructured or semistructured data, as opposed to database applications, which use very structured data. The systems also deal primarily with textual information, but they may also entail images, voice, video or other data types that are part of multimedia information systems. Information filtering systems also involve a large amount of data and streams of incoming data, whether broadcast from a remote source or sent directly by other sources. Filtering is based on descriptions of individual or group information preferences, or profiles, that typically represent long-term interests. Filtering also implies removal of data from an incoming stream rather than finding data in the stream; users see only the data that is extracted. Models of information retrieval and filtering, and lessons for filtering from retrieval research are presented.
Information Retrieval Interaction
, 1992
"... this document, text or image about?' Gradually moving from the left to the right in Figure 3.1, different understandings of this concept evolve ..."
Abstract
-
Cited by 158 (6 self)
- Add to MetaCart
this document, text or image about?' Gradually moving from the left to the right in Figure 3.1, different understandings of this concept evolve
Incremental Clustering and Dynamic Information Retrieval
, 1997
"... Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retri ..."
Abstract
-
Cited by 129 (3 self)
- Add to MetaCart
Motivated by applications such as document and image classification in information retrieval, we consider the problem of clustering dynamic point sets in a metric space. We propose a model called incremental clustering which is based on a careful analysis of the requirements of the information retrieval application, and which should also be useful in other applications. The goal is to efficiently maintain clusters of small diameter as new points are inserted. We analyze several natural greedy algorithms and demonstrate that they perform poorly. We propose new deterministic and randomized incremental clustering algorithms which have a provably good performance. We complement our positive results with lower bounds on the performance of incremental algorithms. Finally, we consider the dual clustering problem where the clusters are of fixed diameter, and the goal is to minimize the number of clusters. 1 Introduction We consider the following problem: as a sequence of points from a metric...
Document Clustering using Word Clusters via the Information Bottleneck Method
- In ACM SIGIR 2000
, 2000
"... We present a novel implementation of the recently introduced information bottleneck method for unsupervised document clustering. Given a joint empirical distribution of words and documents, p(x; y), we first cluster the words, Y , so that the obtained word clusters, Y_hat , maximally preserve the in ..."
Abstract
-
Cited by 123 (16 self)
- Add to MetaCart
We present a novel implementation of the recently introduced information bottleneck method for unsupervised document clustering. Given a joint empirical distribution of words and documents, p(x; y), we first cluster the words, Y , so that the obtained word clusters, Y_hat , maximally preserve the information on the documents. The resulting joint distribution, p(X; Y_hat ), contains most of the original information about the documents, I(X; Y_hat ) ~= I(X;Y ), but it is much less sparse and noisy. Using the same procedure we then cluster the documents, X , so that the information about the word-clusters is preserved. Thus, we first find word-clusters that capture most of the mutual information about the set of documents, and then find document clusters, that preserve the information about the word clusters. We tested this procedure over several document collections based on subsets taken from the standard 20Newsgroups corpus. The results were assessed by calculating the correlation between the document clusters and the correct labels for these documents. Finding from our experiments show that this double clustering procedure, which uses the information bottleneck method, yields significantly superior performance compared to other common document distributional clustering algorithms. Moreover, the double clustering procedure improves all the distributional clustering methods examined here.
Evaluation of Hierarchical Clustering Algorithms for Document Datasets
- Data Mining and Knowledge Discovery
, 2002
"... Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at ..."
Abstract
-
Cited by 116 (4 self)
- Add to MetaCart
Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections.
On-line New Event Detection and Tracking
, 1998
"... We define and describe the related problems of new event detection and event tracking within a stream of broadcast news stories. We focus on a strict on-line setting-i.e., the system must make decisions about one story before looking at any subsequent stories. Our approach to detection uses a singl ..."
Abstract
-
Cited by 106 (4 self)
- Add to MetaCart
We define and describe the related problems of new event detection and event tracking within a stream of broadcast news stories. We focus on a strict on-line setting-i.e., the system must make decisions about one story before looking at any subsequent stories. Our approach to detection uses a single pass clustering algorithm and a novel thresholding model that incorporates the properties of events as a major component. Our ap-proach to tracking is similar to typical information filtering methods. We discuss the value of “surprising” features that have unusual occurrence characteristics, and briefly explore on-line adaptive filtering to handle evolving events in the news. New event detection and event tracking are part of the Topic Detection and Tracking (TDT) initiative.
A Study on Retrospective and On-Line Event Detection
, 1998
"... This paper investigates the use and extension of text retrieval and clustering techniques for event detection. The task is to automatically detect novel events from a temporally-ordered stream of news stories, either retrospectively or as the stories arrive. We applied hierarchical and non-hierarchi ..."
Abstract
-
Cited by 104 (8 self)
- Add to MetaCart
This paper investigates the use and extension of text retrieval and clustering techniques for event detection. The task is to automatically detect novel events from a temporally-ordered stream of news stories, either retrospectively or as the stories arrive. We applied hierarchical and non-hierarchical document clustering algorithms to a corpus of 15,836 stories, focusing on the exploitation of both content and temporal information. We found the resulting cluster hierarchies highly informative for retrospective detection of previously unidentified events, effectively supporting both query-free and query-driven retrieval. We also found that temporal distribution patterns of document clusters provide useful information for improvement in both retrospective detection and on-line detection of novel events. In an evaluation using manually labelled events to judge the system-detected events, we obtained a result of 82% in the F1 measure for retrospective detection, and a F1 value of 42% for...
Bead: Explorations in Information Visualization
- In Proceedings of ACM SIGIR
, 1992
"... We describe work on the visualization of bibliographic data and, to aid in this task, the application of numerical techniques for multidimensional scaling. Many areas of scientific research involve complex multivariate data. One example of this is Information Retrieval. Document comparisons may be d ..."
Abstract
-
Cited by 94 (0 self)
- Add to MetaCart
We describe work on the visualization of bibliographic data and, to aid in this task, the application of numerical techniques for multidimensional scaling. Many areas of scientific research involve complex multivariate data. One example of this is Information Retrieval. Document comparisons may be done using a large number of variables. Such conditions do not favour the more wellknown methods of visualization and graphical analysis, as it is rarely feasible to map each variable onto one aspect of even a three-dimensional, coloured and textured space. Bead is a prototype system for the graphically-based exploration of information. In this system, articles in a bibliography are represented by particles in 3-space. By using physically-based modelling techniques to take advantage of fast methods for the approximation of potential fields, we represent the relationships between articles by their relative spatial positions. Inter-particle forces tend to make similar articles move closer to on...

