Results 11  20
of
2,984
Hierarchical Document Clustering Using Frequent Itemsets
 IN PROC. SIAM INTERNATIONAL CONFERENCE ON DATA MINING 2003 (SDM 2003
, 2003
"... A major challenge in document clustering is the extremely high dimensionality. For example, the vocabulary for a document set can easily be thousands of words. On the other hand, each document often contains a small fraction of words in the vocabulary. These features require special handlings. Anoth ..."
Abstract

Cited by 126 (3 self)
 Add to MetaCart
A major challenge in document clustering is the extremely high dimensionality. For example, the vocabulary for a document set can easily be thousands of words. On the other hand, each document often contains a small fraction of words in the vocabulary. These features require special handlings. Another requirement is hierarchical clustering where clustered documents can be browsed according to the increasing specificity of topics. In this paper, we propose to use the notion of frequent itemsets, which comes from association rule mining, for document clustering. The intuition of our clustering criterion is that each cluster is identified by some common words, called frequent itemsets, for the documents in the cluster. Frequent itemsets are also used to produce a hierarchical topic tree for clusters. By focusing on frequent items, the dimensionality of the document set is drastically reduced. We show that this method outperforms best existing methods in terms of both clustering accuracy and scalability.
Mining Frequent Patterns with Counting Inference
 Sigkdd Explorations
, 2000
"... ACB(D,?E= A&F"=@F"<G?8&:H?E>CI J"FCA; 8:HKMLONQPR1NQSEDT:H; U:V; W 8GA&F XHYHU?</>Z71FC["?I\F"= 8; K]; ^>C8&; F"7VF*_8&:1?`D?I I W ab71FDc7d>*I J"F*A&; 8&:1K e = A&; F*A&;gfih:1; <F"= 8; K]; ^> ..."
Abstract

Cited by 112 (9 self)
 Add to MetaCart
(Show Context)
ACB(D,?E= A&F"=@F"<G?8&:H?E>CI J"FCA; 8:HKMLONQPR1NQSEDT:H; U:V; W 8GA&F XHYHU?</>Z71FC["?I\F"= 8; K]; ^>C8&; F"7VF*_8&:1?`D?I I W ab71FDc7d>*I J"F*A&; 8&:1K e = A&; F*A&;gfih:1; <F"= 8; K]; ^>C8&; F"7; <j1>*<G?XF"7E>.7H?Dk<G8GA>C8&?J*lU>*I I ?X mHn*o opqrks&t*u rHogv r wxv rCypqpr@sp 8:1>C8TA?I ; ?<.F*7z8&:1?/UF"7HU?=H8{F*_c p} mHn*o opqrH~ f?9<G:1FD8&:@>C8]8&:H?9<GY1=H=(FCA&8xFC_`_ A?Y1?78x71F*7HWa?l =1>C8G8?A&7H<U>C7j@?x; _ ?AGA&?X_ AF*KM_ A&?bYH?7b8a?l=1>C8G8&?A&71<`DT; 8&: W F"Y 8E>*UU?<G<&; 71J98:H?ZX1>8>Cj@>C<&?"f\H=@?A&; KE?7b8&<`UF"KE=1>CA&; 71JLNP R1NS/8&F8&:1?T8: A&??`>*I J"F*A&; 8&:HK]< e = A&; F*A&;gB@,I F*<&?`>*71Xzz>CbWGZ; 71?AB <G:1FD8&:@>C8xLNQPR1NS; <]>*KEF"7HJ8&:1?ZKEF"<8?EU; ?7b8]>CI J"F*A&; 8&:HK]< _ FCA{KE; 7H; 71J`_ A?Y1?78T=1>C8G8?A&7H<f 1.
deltaClusters: Capturing Subspace Correlation in a Large Data Set
 Proc. of 18th IEEE Intern. Conf. on Data Engineering
, 2002
"... Clustering has been an active research area of great practical importance for recent years. Most previous clustering models have focused on grouping objects with similar values on a (sub)set of dimensions (e.g., subspace cluster) and assumed that every object has an associated value on every dimensi ..."
Abstract

Cited by 110 (4 self)
 Add to MetaCart
Clustering has been an active research area of great practical importance for recent years. Most previous clustering models have focused on grouping objects with similar values on a (sub)set of dimensions (e.g., subspace cluster) and assumed that every object has an associated value on every dimension (e.g., bicluster). These existing cluster models may not always be adequate in capturing coherence exhibited among objects. Strong coherence may still exist among a set of objects (on a subset of attributes) even if they take quite different values on each attribute and the attribute values are not fully specified. This is very common in many applications including bioinformatics analysis as well as collaborative filtering analysis, where the data may be incomplete and subject to biases. In bioinformatics, a bicluster model has recently been proposed to capture coherence among a subset of the attributes. Here, we introduce a more general model, referred to as the fficluster model, to capture coherence exhibited by a subset of objects on a subset of attributes, while allowing absent attribute values. A movebased algorithm (FLOC) is devised to efficiently produce a nearoptimal clustering results. The fficluster model takes the bicluster model as a special case, where the FLOC algorithm performs far superior to the bicluster algorithm. We demonstrate the correctness and efficiency of the fficluster model and the FLOC algorithm on a number of real and synthetic data sets.
Clustering aggregation
 in ICDE 2005, 2005
"... We consider the following problem: given a set of clusterings, find a clustering that agrees as much as possible with the given clusterings. This problem, clustering aggregation, appears naturally in various contexts. For example, clustering categorical data is an instance of the problem: each cat ..."
Abstract

Cited by 110 (2 self)
 Add to MetaCart
(Show Context)
We consider the following problem: given a set of clusterings, find a clustering that agrees as much as possible with the given clusterings. This problem, clustering aggregation, appears naturally in various contexts. For example, clustering categorical data is an instance of the problem: each categorical variable can be viewed as a clustering of the input rows. Moreover, clustering aggregation can be used as a metaclustering method to improve the robustness of clusterings. The problem formulation does not require apriori information about the number of clusters, and it gives a natural way for handling missing values. We give a formal statement of the clusteringaggregation problem, we discuss related work, and we suggest a number of algorithms. For several of the methods we provide theoretical guarantees on the quality of the solutions. We also show how sampling can be used to scale the algorithms for large data sets. We give an extensive empirical evaluation demonstrating the usefulness of the problem and of the solutions. 1
Rotation forest: A new classifier ensemble method
 IEEE TRANS. PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 2006
"... We propose a method for generating classifier ensembles based on feature extraction. To create the training data for a base classifier, the feature set is randomly split into K subsets (K is a parameter of the algorithm) and Principal Component Analysis (PCA) is applied to each subset. All principa ..."
Abstract

Cited by 104 (5 self)
 Add to MetaCart
We propose a method for generating classifier ensembles based on feature extraction. To create the training data for a base classifier, the feature set is randomly split into K subsets (K is a parameter of the algorithm) and Principal Component Analysis (PCA) is applied to each subset. All principal components are retained in order to preserve the variability information in the data. Thus, K axis rotations take place to form the new features for a base classifier. The idea of the rotation approach is to encourage simultaneously individual accuracy and diversity within the ensemble. Diversity is promoted through the feature extraction for each base classifier. Decision trees were chosen here because they are sensitive to rotation of the feature axes, hence the name “forest. ” Accuracy is sought by keeping all principal components and also using the whole data set to train each base classifier. Using WEKA, we examined the Rotation Forest ensemble on a random selection of 33 benchmark data sets from the UCI repository and compared it with Bagging, AdaBoost, and Random Forest. The results were favorable to Rotation Forest and prompted an investigation into diversityaccuracy landscape of the ensemble models. Diversityerror diagrams revealed that Rotation Forest ensembles construct individual classifiers which are more accurate than these in AdaBoost and Random Forest, and more diverse than these in Bagging, sometimes more accurate as well.
Clustering Intrusion Detection Alarms to Support Root Cause Analysis
 ACM Transactions on Information and System Security
, 2003
"... It is a wellknown problem that intrusion detection systems overload their human operators by triggering thousands of alarms per day. This paper presents a new approach for handling intrusion detection alarms more efficiently. Central to this approach is the notion that each alarm occurs for a reaso ..."
Abstract

Cited by 96 (0 self)
 Add to MetaCart
(Show Context)
It is a wellknown problem that intrusion detection systems overload their human operators by triggering thousands of alarms per day. This paper presents a new approach for handling intrusion detection alarms more efficiently. Central to this approach is the notion that each alarm occurs for a reason, which is referred to as the alarm’s root causes. This paper observes that a few dozens of rather persistent root causes generally account for over 90 % of the alarms that an intrusion detection system triggers. Therefore, we argue that alarms should be handled by identifying and removing the most predominant and persistent root causes. To make this paradigm practicable, we propose a novel alarmclustering method that supports the human analyst in identifying root causes. We present experiments with realworld intrusion detection alarms to show how alarm clustering helped us identify root causes. Moreover, we show that the alarm load decreases quite substantially if the identified root causes are eliminated so that they can no longer trigger alarms in the future.
StreamingData Algorithms for HighQuality Clustering
, 2001
"... As data gathering grows easier, and as researchers discover new ways to interpret data, streamingdata algorithms have become essential in many fields. Data stream computation precludes algorithms that require random access or large memory. In this paper, we consider the problem of clustering data s ..."
Abstract

Cited by 95 (1 self)
 Add to MetaCart
As data gathering grows easier, and as researchers discover new ways to interpret data, streamingdata algorithms have become essential in many fields. Data stream computation precludes algorithms that require random access or large memory. In this paper, we consider the problem of clustering data streams, which is important in the analysis a variety of sources of data streams, such as routing data, telephone records, web documents, and clickstreams. We provide a new clustering algorithms with theoretical guarantees on its performance. We give empirical evidence of its superiority over the commonlyused kMeans algorithm. We then adapt our algorithm to be able to operate on data streams and experimentally demonstrate its superior performance in this context.
ClosureTree: An Index Structure for Graph Queries
, 2006
"... Graphs have become popular for modeling structured data. As a result, graph queries are becoming common and graph indexing has come to play an essential role in query processing. We introduce the concept of a graph closure, a generalized graph that represents a number of graphs. Our indexing techniq ..."
Abstract

Cited by 88 (1 self)
 Add to MetaCart
(Show Context)
Graphs have become popular for modeling structured data. As a result, graph queries are becoming common and graph indexing has come to play an essential role in query processing. We introduce the concept of a graph closure, a generalized graph that represents a number of graphs. Our indexing technique, called Closuretree, organizes graphs hierarchically where each node summarizes its descendants by a graph closure. Closuretree can efficiently support both subgraph queries and similarity queries. Subgraph queries find graphs that contain a specific subgraph, whereas similarity queries find graphs that are similar to a query graph. For subgraph queries, we propose a technique called pseudo subgraph isomorphism which approximates subgraph isomorphism with high accuracy. For similarity queries, we measure graph similarity through edit distance using heuristic graph mapping methods. We implement two kinds of similarity queries: KNN query and range query. Our experiments on chemical compounds and synthetic graphs show that for subgraph queries, Closuretree outperforms existing techniques by up to two orders of magnitude in terms of candidate answer set size and index size. For similarity queries, our experiments validate the quality and efficiency of the presented algorithms.
PrivacyPreserving Multivariate Statistical Analysis: Linear Regression and Classification
 In Proceedings of the 4th SIAM International Conference on Data Mining
, 2004
"... analysis technique that has found applications in various areas. In this paper, we study some multivariate statistical analysis methods in Secure 2party Computation (S2C) framework illustrated by the following scenario: two parties, each having a secret data set, want to conduct the statistical ana ..."
Abstract

Cited by 87 (1 self)
 Add to MetaCart
(Show Context)
analysis technique that has found applications in various areas. In this paper, we study some multivariate statistical analysis methods in Secure 2party Computation (S2C) framework illustrated by the following scenario: two parties, each having a secret data set, want to conduct the statistical analysis on their joint data, but neither party is willing to disclose its private data to the other party or any third party. The current statistical analysis techniques cannot be used directly to support this kind of computation because they require all parties to send the necessary data to a central place. In this paper, We define two Secure 2party multivariate statistical analysis problems: Secure 2party Multivariate Linear Regression problem and Secure 2party Multivariate Classification problem. We have developed a practical security model, based on which we have developed a number of building blocks for solving these two problems.
Privacy Preserving Clustering By Data Transformation
 IN PROC. OF THE 18TH BRAZILIAN SYMPOSIUM ON DATABASES
, 2003
"... Despite its benefit in a wide range of applications, data mining techniques also have raised a number of ethical issues. Some such issues include those of privacy, data security, intellectual property rights, and many others. In this paper, we address the privacy problem against unauthorized seconda ..."
Abstract

Cited by 77 (3 self)
 Add to MetaCart
Despite its benefit in a wide range of applications, data mining techniques also have raised a number of ethical issues. Some such issues include those of privacy, data security, intellectual property rights, and many others. In this paper, we address the privacy problem against unauthorized secondary use of information. To do so, we introduce a family of geometric data transformation methods (GDTMs) which ensure that the mining process will not violate privacy up to a certain degree of security. We focus primarily on privacy preserving data clustering, notably on partitionbased and hierarchical methods. Our proposed methods distort only confidential numerical attributes to meet privacy requirements, while preserving general features for clustering analysis. Our experiments demonstrate that our methods are effective and provide acceptable values in practice for balancing privacy and accuracy. We report the main results of our performance evaluation and discuss some open research issues.