Results 1  10
of
28
Clustering aggregation
 in ICDE 2005, 2005
"... We consider the following problem: given a set of clusterings, find a clustering that agrees as much as possible with the given clusterings. This problem, clustering aggregation, appears naturally in various contexts. For example, clustering categorical data is an instance of the problem: each cat ..."
Abstract

Cited by 110 (2 self)
 Add to MetaCart
(Show Context)
We consider the following problem: given a set of clusterings, find a clustering that agrees as much as possible with the given clusterings. This problem, clustering aggregation, appears naturally in various contexts. For example, clustering categorical data is an instance of the problem: each categorical variable can be viewed as a clustering of the input rows. Moreover, clustering aggregation can be used as a metaclustering method to improve the robustness of clusterings. The problem formulation does not require apriori information about the number of clusters, and it gives a natural way for handling missing values. We give a formal statement of the clusteringaggregation problem, we discuss related work, and we suggest a number of algorithms. For several of the methods we provide theoretical guarantees on the quality of the solutions. We also show how sampling can be used to scale the algorithms for large data sets. We give an extensive empirical evaluation demonstrating the usefulness of the problem and of the solutions. 1
Link analysis ranking: algorithms, theory, and experiments
 ACM Transactions on Internet Technology
, 2005
"... The explosive growth and the widespread accessibility of the Web has led to a surge of research activity in the area of information retrieval on the World Wide Web. The seminal papers of Kleinberg [1998, 1999] and Brin and Page [1998] introduced Link Analysis Ranking, where hyperlink structures are ..."
Abstract

Cited by 58 (1 self)
 Add to MetaCart
(Show Context)
The explosive growth and the widespread accessibility of the Web has led to a surge of research activity in the area of information retrieval on the World Wide Web. The seminal papers of Kleinberg [1998, 1999] and Brin and Page [1998] introduced Link Analysis Ranking, where hyperlink structures are used to determine the relative authority of a Web page and produce improved algorithms for the ranking of Web search results. In this article we work within the hubs and authorities framework defined by Kleinberg and we propose new families of algorithms. Two of the algorithms we propose use a Bayesian approach, as opposed to the usual algebraic and graph theoretic approaches. We also introduce a theoretical framework for the study of Link Analysis Ranking algorithms. The framework allows for the definition of specific properties of Link Analysis Ranking algorithms, as well as for comparing different algorithms. We study the properties of the algorithms that we define, and we provide an axiomatic characterization of the INDEGREE heuristic which ranks each node according to the number of incoming links. We conclude the article with an extensive experimental evaluation. We study the quality of the algorithms, and we examine how different structures in the graphs affect their performance.
Largescale deduplication with constraints using dedupalog
 in: Proceedings of the 25th International Conference on Data Engineering (ICDE
"... Abstract — We present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and can improve the quality of deduplication. An example of a constraint is “each paper has a unique publication v ..."
Abstract

Cited by 47 (3 self)
 Add to MetaCart
(Show Context)
Abstract — We present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and can improve the quality of deduplication. An example of a constraint is “each paper has a unique publication venue”; iftwo paper references are duplicates, then their associated conference references must be duplicates as well. Our framework supports collective deduplication, meaning that we can dedupe both paper references and conference references collectively in the example above. Our framework is based on a simple declarative Datalogstyle language with precise semantics. Most previous work on deduplication either ignore constraints or use them in an adhoc domainspecific manner. We also present efficient algorithms to support the framework. Our algorithms have precise theoretical guarantees for a large subclass of our framework. We show, using a prototype implementation, that our algorithms scale to very large datasets. We provide thorough experimental results over realworld data demonstrating the utility of our framework for highquality and scalable deduplication. I.
Informationtheoretic tools for mining database structure from large data sets
 In SIGMOD
, 2004
"... Data design has been characterized as a process of arriving at a design that maximizes the information content of each piece of data (or equivalently, one that minimizes redundancy). Information content (or redundancy) is measured with respect to a prescribed model for the data, a model that is ofte ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
(Show Context)
Data design has been characterized as a process of arriving at a design that maximizes the information content of each piece of data (or equivalently, one that minimizes redundancy). Information content (or redundancy) is measured with respect to a prescribed model for the data, a model that is often expressed as a set of constraints. In this work, we consider the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete. Specifically, we consider the problem of finding structural clues in an instance of data, an instance which may contain errors, missing values, and duplicate records. We propose a set of informationtheoretic tools for finding structural summaries that are useful in characterizing the information content of the data, and ultimately useful in data design. We provide algorithms for creating these summaries over large, categorical data sets. We study the use of these summaries in one specific physical design task, that of ranking functional dependencies based on their data redundancy. We show how our ranking can be used by a physical datadesign tool to find good vertical decompositions of a relation (decompositions that improve the information content of the design). We present an evaluation of the approach on real data sets. 1.
Data mining for casebased reasoning in highdimensional biological domains
 Transactions on Knowledge and Data Engineering
"... Abstract—Casebased reasoning (CBR) is a suitable paradigm for class discovery in molecular biology, where the rules that define the domain knowledge are difficult to obtain and the number and the complexity of the rules affecting the problem are too large for formal knowledge representation. To ext ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Casebased reasoning (CBR) is a suitable paradigm for class discovery in molecular biology, where the rules that define the domain knowledge are difficult to obtain and the number and the complexity of the rules affecting the problem are too large for formal knowledge representation. To extend the capabilities of CBR, we propose the mixture of experts for casebased reasoning (MOE4CBR), a method that combines an ensemble of CBR classifiers with spectral clustering and logistic regression. Our approach not only achieves higher prediction accuracy, but also leads to the selection of a subset of features that have meaningful relationships with their class labels. We evaluate MOE4CBR by applying the method to a CBR system called TA3—a computational framework for CBR systems. For two ovarian mass spectrometry data sets, the prediction accuracy improves from 80 percent to 93 percent and from 90 percent to 98.4 percent, respectively. We also apply the method to leukemia and lung microarray data sets with prediction accuracy improving from 65 percent to 74 percent and from 60 percent to 70 percent, respectively. Finally, we compare our list of discovered biomarkers with the lists of selected biomarkers from other studies for the mass spectrometry data sets. Index Terms—Machine learning, data mining, clustering, feature selection, casebased reasoning classifiers, microarray data analysis, mass spectrometry data analysis, biomarker discovery. 1
Parameterfree spatial data mining using MDL
 In 5th International Conference on Data Mining (ICDM
, 2005
"... Consider spatial data consisting of a set of binary features taking values over a collection of spatial extents (grid cells). We propose a method that simultaneously finds spatial correlation and feature cooccurrence patterns, without any parameters. In particular, we employ the Minimum Description ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Consider spatial data consisting of a set of binary features taking values over a collection of spatial extents (grid cells). We propose a method that simultaneously finds spatial correlation and feature cooccurrence patterns, without any parameters. In particular, we employ the Minimum Description Length (MDL) principle coupled with a natural way of compressing regions. This defines what “good” means: a feature cooccurrence pattern is good, if it helps us better compress the set of locations for these features. Conversely, a spatial correlation is good, if it helps us better compress the set of features in the corresponding region. Our approach is scalable for large datasets (both number of locations and of features). We evaluate our method on both real and synthetic datasets. 1
Detecting the change of clustering structure in categorical data streams
 SIAM Data Mining Conference
, 2006
"... Analyzing clustering structures in data streams can provide critical information for making decision in realtime. Most research has been focused on clustering algorithms for data streams. We argue that, more importantly, we need to monitor the change of clustering structure online. In this paper, we ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
(Show Context)
Analyzing clustering structures in data streams can provide critical information for making decision in realtime. Most research has been focused on clustering algorithms for data streams. We argue that, more importantly, we need to monitor the change of clustering structure online. In this paper, we present a framework for detecting the change of critical clustering structure in categorical data streams, which is indicated by the change of the best number of clusters (Best K) in the data stream. The framework extends the work on determining the best K for static datasets (the BkPlot method) to categorical data streams with the help of a Hierarchical Entropy Tree structure (HETree). HETree can efficiently capture the entropy property of the categorical data streams and allow us to draw precise clustering information from the data stream for highquality BkPLots. The experiments show that with the combination of HETree and the BkPlot method we are able to efficiently and precisely detect the change of critical clustering structure in categorical data streams. 1
The ”best k” for entropybased categorical data clustering
 In Inter. Conf. on Scien. and Stat. Database Management
, 2005
"... With the growing demand on cluster analysis for categorical data, a handful of categorical clustering algorithms have been developed. Surprisingly, to our knowledge, none has satisfactorily addressed the important problem for categorical clustering – how can we determine the best K number of cluster ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
With the growing demand on cluster analysis for categorical data, a handful of categorical clustering algorithms have been developed. Surprisingly, to our knowledge, none has satisfactorily addressed the important problem for categorical clustering – how can we determine the best K number of clusters for a categorical dataset? Since the categorical data does not have the inherent distance function as the similarity measure, the traditional cluster validation techniques based on the geometry shape and density distribution cannot be applied to answer this question. In this paper, we investigate the entropy property of the categorical data and propose a BkPlot method for determining a set of candidate “best Ks”. This method is implemented with a hierarchical clustering algorithm HierEntro. The experimental result shows that our approach can effectively identify the significant clustering structures.
Coupled clustering ensemble: Incorporating coupling relationships both between base clusterings and objects
 In Proceedings of ICDE2013
, 2013
"... Abstract—Clustering ensemble is a powerful approach for improving the accuracy and stability of individual (base) clustering algorithms. Most of the existing clustering ensemble methods obtain the final solutions by assuming that base clusterings perform independently with one another and all objec ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Abstract—Clustering ensemble is a powerful approach for improving the accuracy and stability of individual (base) clustering algorithms. Most of the existing clustering ensemble methods obtain the final solutions by assuming that base clusterings perform independently with one another and all objects are independent too. However, in realworld data sources, objects are more or less associated in terms of certain coupling relationships. Base clusterings trained on the source data are complementary to one another since each of them may only capture some specific rather than full picture of the data. In this paper, we discuss the problem of explicating the dependency between base clusterings and between objects in clustering ensembles, and propose a framework for coupled clustering ensembles (CCE). CCE not only considers but also integrates the coupling relationships between base clusterings and between objects. Specifically, we involve both the intracoupling within one base clustering (i.e., cluster label frequency distribution) and the intercoupling between different base clusterings (i.e., cluster label cooccurrence dependency). Furthermore, we engage both the intracoupling between two objects in terms of the base clustering aggregation and the intercoupling among other objects in terms of neighborhood relationship. This is the first work which explicitly addresses the dependency between base clusterings and between objects, verified by the application of such couplings in three types of consensus functions: clusteringbased, objectbased and clusterbased. Substantial experiments on synthetic and UCI data sets demonstrate that the CCE framework can effectively capture the interactions embedded in base clusterings and objects with higher clustering accuracy and stability compared to several stateoftheart techniques, which is also supported by statistical analysis. I.
HETree: a Framework for Detecting Changes in Clustering Structure for Categorical Data Streams
"... Analyzing clustering structures in data streams can provide critical information for realtime decision making. Most research in this area has focused on clustering algorithms for numerical data streams, and very few have proposed to monitor the change of clustering structure. Most surprisingly, to ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Analyzing clustering structures in data streams can provide critical information for realtime decision making. Most research in this area has focused on clustering algorithms for numerical data streams, and very few have proposed to monitor the change of clustering structure. Most surprisingly, to our knowledge, no work has been proposed on monitoring clustering structure for categorical data streams. In this paper, we present a framework for detecting the change of primary clustering structure in categorical data streams, which is indicated by the change of the best number of clusters (Best K) in the data stream. The framework uses a Hierarchical Entropy Tree structure (HETree) to capture the entropy characteristics of clusters in a data stream, and detects the change of Best K by combining our previously developed BKPlot method. The HETree can efficiently summarize the entropy property of a categorical data stream and allow us to draw precise clustering information from the data stream for generating highquality BKPlots. We also develop the timedecaying HETree structure to make the monitoring more sensitive to recent changes of clustering structure. The experimental result shows that with the combination of the HETree and the BKPlot method we are able to promptly and precisely detect the change of clustering structure in categorical data streams.