Results 1 - 10
of
55
Cluster Ensembles - A Knowledge Reuse Framework for Combining Multiple Partitions
- Journal of Machine Learning Research
, 2002
"... This paper introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. We first identify several application scenarios for the resultant 'knowledge reuse&ap ..."
Abstract
-
Cited by 603 (20 self)
- Add to MetaCart
(Show Context)
This paper introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. We first identify several application scenarios for the resultant 'knowledge reuse' framework that we call cluster ensembles. The cluster ensemble problem is then formalized as a combinatorial optimization problem in terms of shared mutual information. In addition to a direct maximization approach, we propose three effective and efficient techniques for obtaining high-quality combiners (consensus functions). The first combiner induces a similarity measure from the partitionings and then reclusters the objects. The second combiner is based on hypergraph partitioning. The third one collapses groups of clusters into meta-clusters which then compete for each object to determine the combined clustering. Due to the low computational costs of our techniques, it is quite feasible to use a supra-consensus function that evaluates all three approaches against the objective function and picks the best solution for a given situation. We evaluate the effectiveness of cluster ensembles in three qualitatively different application scenarios: (i) where the original clusters were formed based on non-identical sets of features, (ii) where the original clustering algorithms worked on non-identical sets of objects, and (iii) where a common data-set is used and the main purpose of combining multiple clusterings is to improve the quality and robustness of the solution. Promising results are obtained in all three situations for synthetic as well as real data-sets.
Learning to cluster web search results
- In Proc. of SIGIR ’04
, 2004
"... In web search, surfers are often faced with the problem of selecting their most wanted information from the potential huge amount of search results. The clustering of web search results is the possible solution, but the traditional content based clustering is not sufficient since it ignores many uni ..."
Abstract
-
Cited by 195 (7 self)
- Add to MetaCart
In web search, surfers are often faced with the problem of selecting their most wanted information from the potential huge amount of search results. The clustering of web search results is the possible solution, but the traditional content based clustering is not sufficient since it ignores many unique features of web pages. The link structure, authority, quality, or trustfulness of search results can play even the higher role than the actual contents of the web pages in clustering. These possible extents are reflected by Google's PageRank algorithm, HITS algorithm and etc. The main goal of this project is to integrate the authoritative information such as PageRank, link structure (e.g. in-links and out-links) into the K-Means clustering of web search results. The PageRank, inlinks and out-links can be used to extend the vector representation of web pages, and the PageRank can also be considered in the initial centroids selection, or the web page with higher PageRank influences the centroid computation to a higher degree. The relevance of this modified K-Means clustering algorithm needs to be compared to the ones obtained by the content based K-Means clustering, and the effects of different authoritative information also needs to be analyzed.
Information retrieval on the Web
- ACM Computing Surveys
, 2000
"... In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical ..."
Abstract
-
Cited by 95 (0 self)
- Add to MetaCart
In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical figures vary, overall trends cited
The connectivity sonar: detecting site functionality by structural patterns
- In Proceedings of the Fourteenth ACM Conference on Hypertext and Hypermedia
, 2003
"... Web sites today serve many different functions, such as corporate sites, search engines, e-stores, and so forth. As sites are created for different purposes, their structure and connectivity characteristics vary. However, this research argues that sites of similar role exhibit similar structural pat ..."
Abstract
-
Cited by 62 (1 self)
- Add to MetaCart
(Show Context)
Web sites today serve many different functions, such as corporate sites, search engines, e-stores, and so forth. As sites are created for different purposes, their structure and connectivity characteristics vary. However, this research argues that sites of similar role exhibit similar structural patterns, as the functionality of a site naturally induces a typical hyperlinked structure and typical connectivity patterns to and from the rest of the Web. Thus, the functionality of Web sites is reflected in a set of structural and connectivity-based features that form a typical signature. In this paper, we automatically categorize sites into eight distinct functional classes, and highlight several search-engine related applications that could make immediate use of such technology. We purposely limit our categorization algorithms by tapping connectivity and structural data alone, making no use of any content analysis whatsoever. When applying two classification algorithms to a set of 202 sites of the eight defined functional categories, the algorithms correctly classified between 54.5 % and 59 % of the sites. On some categories, the precision of the classification exceeded 85%. An additional result of this work indicates that the structural signature can be used to detect spam rings and mirror sites, by clustering sites with almost identical signatures.
Finding the Story - Broader Applicability of Semantics and Discourse for Hypermedia Generation
- In Proceedings of the 14th ACM conference on Hypertext and Hypermedia
, 2003
"... Generating hypermedia presentations requires processing constituent material into coherent, unified presentations. One large challenge is creating a generic process for producing hypermedia presentations from the semantics of potentially unfamiliar domains. The resulting presentations must both ..."
Abstract
-
Cited by 45 (11 self)
- Add to MetaCart
(Show Context)
Generating hypermedia presentations requires processing constituent material into coherent, unified presentations. One large challenge is creating a generic process for producing hypermedia presentations from the semantics of potentially unfamiliar domains. The resulting presentations must both respect the underlying semantics and appear as coherent, plausible and, if possible, pleasant to the user. Among the related unsolved problems is the inclusion of discourse knowledge in the generation process. One potential approach is generating a discourse structure derived from generic processing of the underlying domain semantics, transforming this to a structured progression and then using this to steer the choice of hypermedia communicative devices used to convey the actual information in the resulting presentation.
Link Analysis in Web Information Retrieval
- IEEE DATA ENGINEERING BULLETIN
, 2000
"... The analysis of the hyperlink structure of the web has led to significant improvements in web information retrieval. This survey describes two successful link analysis algorithms and the state-of-the art of the field. ..."
Abstract
-
Cited by 44 (0 self)
- Add to MetaCart
The analysis of the hyperlink structure of the web has led to significant improvements in web information retrieval. This survey describes two successful link analysis algorithms and the state-of-the art of the field.
Clustering relational data using attribute and link information
- In Proceedings of the Text Mining and Link Analysis Workshop, 18th International Joint Conference on Artificial Intelligence
, 2003
"... Clustering is a descriptive task that seeks to identify natural groupings in data. Relational data offer a wealth of information for identifying groups of similar items. Both attribute information and the structure of relationships can be used for clustering. Graph partitioning and data clustering t ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
Clustering is a descriptive task that seeks to identify natural groupings in data. Relational data offer a wealth of information for identifying groups of similar items. Both attribute information and the structure of relationships can be used for clustering. Graph partitioning and data clustering techniques can be applied independently to relational data but a technique that exploits both sources of information simultaneously may produce more meaningful clusters. This paper will describe our work synthesizing data clustering and graph partitioning techniques into improved clustering algorithms for relational data. 1
Deterministic pivoting algorithms for constrained ranking and Clustering Problems
, 2007
"... We consider ranking and clustering problems related to the aggregation of inconsistent information, in particular, rank aggregation, (weighted) feedback arc set in tournaments, consensus and correlation clustering, and hierarchical clustering. Ailon, Charikar, and Newman [4], Ailon and Charikar [3], ..."
Abstract
-
Cited by 34 (4 self)
- Add to MetaCart
We consider ranking and clustering problems related to the aggregation of inconsistent information, in particular, rank aggregation, (weighted) feedback arc set in tournaments, consensus and correlation clustering, and hierarchical clustering. Ailon, Charikar, and Newman [4], Ailon and Charikar [3], and Ailon [2] proposed randomized constant factor approximation algorithms for these problems, which recursively generate a solution by choosing a random vertex as “pivot ” and dividing the remaining vertices into two groups based on the pivot vertex. In this paper, we answer an open question in these works by giving deterministic approximation algorithms for these problems. The analysis of our algorithms is simpler than the analysis of the randomized algorithms in [4], [3] and [2]. In addition, we consider the problem of finding minimum-cost rankings and clusterings which must obey certain constraints (e.g. an input partial order in the case of ranking problems), which were introduced by Hegde and Jain [25] (see also [34]). We show that the first type of algorithms we propose can also handle these constrained problems. In addition, we show that in the case of a rank aggregation or consensus clustering problem, if the input rankings or clusterings obey the constraints, then we can always ensure that the output of
Exploiting relational structure to understand publication patterns in high-energy physics
- SIGKDD Explorations
, 2003
"... We analyze publication patterns in theoretical high-energy physics using a relational learning approach. We focus on four related areas: understanding and identifying patterns of citations, examining publication patterns at the author level, predicting whether a paper will be accepted by specific jo ..."
Abstract
-
Cited by 32 (8 self)
- Add to MetaCart
We analyze publication patterns in theoretical high-energy physics using a relational learning approach. We focus on four related areas: understanding and identifying patterns of citations, examining publication patterns at the author level, predicting whether a paper will be accepted by specific journals, and identifying research communities from the citation patterns and paper text. Each of these analyses contributes to an overall understanding of theoretical highenergy physics. 1.
Automatic Topic Identification Using Webpage Clustering
, 2001
"... Grouping webpages into distinct topics is one way to organize the large amount of retrieved information on the web. In this paper, we report that based on similarity metric which incorporates textual information, hyperlink structure and co-citation relations, an unsupervised clustering method can au ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
Grouping webpages into distinct topics is one way to organize the large amount of retrieved information on the web. In this paper, we report that based on similarity metric which incorporates textual information, hyperlink structure and co-citation relations, an unsupervised clustering method can automatically and effectively identify relevant topics, as shown in experiments on several retrieved sets of webpages. The clustering method is a state-of-art spectral graph partitioning method based on normalized cut criterion first developed for image segmentation.