Results 1  10
of
26
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 400 (0 self)
 Add to MetaCart
(Show Context)
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Random projection in dimensionality reduction: Applications to image and text data
 in Knowledge Discovery and Data Mining
, 2001
"... Random projections have recently emerged as a powerful method for dimensionality reduction. Theoretical results indicate that the method preserves distances quite nicely; however, empirical results are sparse. We present experimental results on using random projection as a dimensionality reduction t ..."
Abstract

Cited by 239 (0 self)
 Add to MetaCart
(Show Context)
Random projections have recently emerged as a powerful method for dimensionality reduction. Theoretical results indicate that the method preserves distances quite nicely; however, empirical results are sparse. We present experimental results on using random projection as a dimensionality reduction tool in a number of cases, where the high dimensionality of the data would otherwise lead to burdensome computations. Our application areas are the processing of both noisy and noiseless images, and information retrieval in text documents. We show that projecting the data onto a random lowerdimensional subspace yields results comparable to conventional dimensionality reduction methods such as principal component analysis: the similarity of data vectors is preserved well under random projection. However, using random projections is computationally signicantly less expensive than using, e.g., principal component analysis. We also show experimentally that using a sparse random matrix gives additional computational savings in random projection.
A.K.H.: Similarity evaluation on treestructured data
 In: SIGMOD
, 2005
"... Treestructured data are becoming ubiquitous nowadays and manipulating them based on similarity is essential for many applications. The generally accepted similarity measure for trees is the edit distance. Although similarity search has been extensively studied, searching for similar trees is still ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
(Show Context)
Treestructured data are becoming ubiquitous nowadays and manipulating them based on similarity is essential for many applications. The generally accepted similarity measure for trees is the edit distance. Although similarity search has been extensively studied, searching for similar trees is still an open problem due to the high complexity of computing the tree edit distance. In this paper, we propose to transform treestructured data into an approximate numerical multidimensional vector which encodes the original structure information. We prove that the L1 distance of the corresponding vectors, whose computational complexity is O(T1+ T2), forms a lower bound for the edit distance between trees. Based on the theoretical analysis, we describe a novel algorithm which embeds the proposed distance into a filterandrefine framework to process similarity search on treestructured data. The experimental results show that our algorithm reduces dramatically the distance computation cost. Our method is especially suitable for accelerating similarity query processing on large trees in massive datasets. 1.
A sketch algorithm for estimating twoway and multiway associations
 Computational Linguistics
, 2007
"... We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are strongly associated or not. One can often obtain estimates of associations from a small sample. We develop a sketchbased algorithm that constructs a contingency table for a sample. One can estimate th ..."
Abstract

Cited by 25 (12 self)
 Add to MetaCart
(Show Context)
We should not have to look at the entire corpus (e.g., the Web) to know if two (or more) words are strongly associated or not. One can often obtain estimates of associations from a small sample. We develop a sketchbased algorithm that constructs a contingency table for a sample. One can estimate the contingency table for the entire population using straightforward scaling. However, one can do better by taking advantage of the margins (also known as document frequencies). The proposed method cuts the errors roughly in half over Broder’s sketches. 1.
Algorithms for Storytelling
 In Proc. KDD’06
, 2006
"... We formulate a new data mining problem called storytelling as a generalization of redescription mining. In traditional redescription mining, we are given a set of objects and a collection of subsets defined over these objects. The goal is to view the set system as a vocabulary and identify two expre ..."
Abstract

Cited by 22 (12 self)
 Add to MetaCart
(Show Context)
We formulate a new data mining problem called storytelling as a generalization of redescription mining. In traditional redescription mining, we are given a set of objects and a collection of subsets defined over these objects. The goal is to view the set system as a vocabulary and identify two expressions in this vocabulary that induce the same set of objects. Storytelling, on the other hand, aims to explicitly relate object sets that are disjoint (and hence, maximally dissimilar) by finding a chain of (approximate) redescriptions between the sets. This problem finds applications in bioinformatics, for instance, where the biologist is trying to relate a set of genes expressed in one experiment to another set, implicated in a different pathway. We outline an efficient storytelling implementation that embeds the CARTwheels redescription mining algorithm in an A * search procedure, using the former to supply next move operators on search branches to the latter. This approach is practical and effective for mining large datasets and, at the same time, exploits the structure of partitions imposed by the given vocabulary. Three application case studies are presented: a study of word overlaps in large English dictionaries, exploring connections between genesets in a bioinformatics dataset, and relating publications in the PubMed index of abstracts.
Efficient similarity search for market basket data
 VLDB Journal
, 2002
"... Abstract. Several organizations have developed very large market basket databases for the maintenance of customer transactions. New applications, e.g., Web recommendation systems, present the requirement for processing similarity queries in market basket databases. In this paper, we propose a novel ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
(Show Context)
Abstract. Several organizations have developed very large market basket databases for the maintenance of customer transactions. New applications, e.g., Web recommendation systems, present the requirement for processing similarity queries in market basket databases. In this paper, we propose a novel scheme for similarity search queries in basket data. We develop a new representation method, which, in contrast to existing approaches, is proven to provide correct results. New algorithms are proposed for the processing of similarity queries. Extensive experimental results, for a variety of factors, illustrate the superiority of the proposed scheme over the stateoftheart method.
Clustering categorical data
 IN: PROC OF ICDE’00
, 2000
"... In this paper we propose two methods to study the problem of clustering categorical data. The first method is based on dynamical system approach. The second method is based on the graph partitioning approach. Dynamical systems approach for clustering categorical data have been studied by some author ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
In this paper we propose two methods to study the problem of clustering categorical data. The first method is based on dynamical system approach. The second method is based on the graph partitioning approach. Dynamical systems approach for clustering categorical data have been studied by some authors [1]. However, the proposed dynamic algorithm cannot guarantee convergence, so that the execution may get into an infinite loop even for very simple data. We define a new conguration updating algorithm for clustering categorical data sets. Let us consider a relational table with k fields, each of which can assume one of a number of possible values. We represent each possible value in each possible field by an abstract node. Let us denote the nodes by vi(i =1;;m). A configuration is an assignment of weight wi for each node vi. The new algorithm is defined as follows. To update the configuration W: create a temporary configuration W 0 with weights w 0 1;:::; w0 m for each weight wui2 W ffor each tuple =fvu1;vu2;;vukg containing vui do x wu1 + + cwui + + wuk: w 0 ui P x.
On indexing errortolerant set containment
 In SIGMOD
, 2010
"... Prior work has identified set based comparisons as a useful primitive for supporting a wide variety of similarity functions in record matching. Accordingly, various techniques have been proposed to improve the performance of set similarity lookups. However, this body of work focuses almost exclusive ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
Prior work has identified set based comparisons as a useful primitive for supporting a wide variety of similarity functions in record matching. Accordingly, various techniques have been proposed to improve the performance of set similarity lookups. However, this body of work focuses almost exclusively on symmetric notions of set similarity. In this paper, we study the indexing problem for the asymmetric Jaccard containment similarity function that is an errortolerant variation of set containment. We enhance this similarity function to also account for string transformations that reflect synonyms such as “Bob ” and “Robert ” referring to the same first name. We propose an index structure that builds inverted lists on carefully chosen tokensets and a lookup algorithm using our index that is sensitive to the output size of the query. Our experiments over real life data sets show the benefits of our techniques. To our knowledge, this is the first paper that studies the indexing problem for Jaccard containment in the presence of string transformations.
On the Use of Constrained Associations for Web Log Mining
, 2002
"... this paper, we first present an approach based on association rule mining. Our algorithm discovers association rules that are constrained (and ordered) temporally. The approach relies on the simple premise that pages accessed recently have a greater influence on pages that will be accessed in the ne ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
this paper, we first present an approach based on association rule mining. Our algorithm discovers association rules that are constrained (and ordered) temporally. The approach relies on the simple premise that pages accessed recently have a greater influence on pages that will be accessed in the near future. The approach not only results in better predictions, it also prunes the rulespace significantly, thus enabling faster online prediction. Further refinements based on sequential dominance are also evaluated, and prove to be quite effective. Detailed experimental evaluation shows how the approach is quite effective in capturing a web user's access patterns; consequently, our prediction model not only has good prediction accuracy, but also is more efficient in terms of space and time complexity. The approach is also likely to generalize for ecommerce recommendation systems
Redescription Mining: Algorithms and Applications in Bioinformatics
, 2007
"... Scientific data mining purports to extract useful knowledge from massive datasets curated through computational science efforts, e.g., in bioinformatics, cosmology, geographic sciences, and computational chemistry. In the recent past, we have witnessed major transformations of these applied sciences ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Scientific data mining purports to extract useful knowledge from massive datasets curated through computational science efforts, e.g., in bioinformatics, cosmology, geographic sciences, and computational chemistry. In the recent past, we have witnessed major transformations of these applied sciences into datadriven endeavors. In particular, scientists are now faced with an overload of vocabularies for describing domain entities. All of these vocabularies offer alternative and mostly complementary (sometimes, even contradictory) ways to organize information and each vocabulary provides a different perspective into the problem being studied. To further knowledge discovery, computational scientists need tools to help uniformly reason across vocabularies, integrate multiple forms of characterizing datasets, and situate knowledge gained from one study in terms of others. This dissertation defines a new pattern class called redescriptions that provides high level capabilities for reasoning across domain vocabularies. A redescription is a shift of vocabulary, or a different way of communicating the same information; redescription mining finds concerted sets of objects that can be defined in (at least) two ways using given descriptors. We present