Results 11 - 20
of
38
O-cluster: scalable clustering of large high dimensional data sets
- In Data Mining, Proceedings from the IEEE International Conference on
, 2002
"... Clustering large data sets of high dimensionality has always been a serious challenge for clustering algorithms. Many recently developed clustering algorithms have attempted to address either handling data sets with very large number of records or data sets with very high number of dimensions. This ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Clustering large data sets of high dimensionality has always been a serious challenge for clustering algorithms. Many recently developed clustering algorithms have attempted to address either handling data sets with very large number of records or data sets with very high number of dimensions. This paper provides a discussion of the advantages and limitations of existing algorithms when they operate on very large multidimensional data sets. To simultaneously overcome both the “curse of dimensionality ” and the scalability problems associated with large amounts of data, we propose a new clustering algorithm called O-Cluster. This new clustering method combines a novel active sampling technique with an axis-parallel partitioning strategy to identify continuous areas of high density in the input space. The method operates on a limited memory buffer and requires at most a single scan through the data. We demonstrate the high quality of the obtained clustering solutions, their robustness to noise, and O-Cluster’s excellent scalability. 1.
The challenges of clustering high-dimensional data
- In New Vistas in Statistical Physics: Applications in Econophysics, Bioinformatics, and Pattern Recognition
, 2003
"... Cluster analysis divides data into groups (clusters) for the purposes of summarization or improved understanding. For example, cluster analysis has been used to group related documents for browsing, to find genes and proteins that have similar functionality, or as a means of data compression. While ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Cluster analysis divides data into groups (clusters) for the purposes of summarization or improved understanding. For example, cluster analysis has been used to group related documents for browsing, to find genes and proteins that have similar functionality, or as a means of data compression. While clustering has a long history and a large number of clustering techniques have been developed in statistics, pattern recognition, data mining, and other fields, significant challenges still remain. In this chapter we provide a short introduction to cluster analysis, and then focus on the challenge of clustering high dimensional data. We present a brief overview of several recent techniques, including a more detailed description of recent work of our own which uses a concept-based clustering approach. 1
Hypergraph Models and Algorithms for Data-Pattern Based Clustering
- DATA MINING AND KNOWLEDGE DISCOVERY
, 2004
"... In traditional approaches for clustering market basket type data, relations among transactions are modeled according to the items occurring in these transactions. However, an individual item might induce different relations in different contexts. Since such contexts might be captured by interesting ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
In traditional approaches for clustering market basket type data, relations among transactions are modeled according to the items occurring in these transactions. However, an individual item might induce different relations in different contexts. Since such contexts might be captured by interesting patterns in the overall data, we represent each transaction as a set of patterns through modifying the conventional pattern semantics. By clustering the patterns in the dataset, we infer a clustering of the transactions represented this way. For this, we propose a novel hypergraph model to represent the relations among the patterns. Instead of a local measure that depends only on common items among patterns, we propose a global measure that is based on the cooccurences of these patterns in the overall data. The success of existing hypergraph partitioning based algorithms in other domains depend on sparsity of the hypergraph and explicit objective metrics. For this, we propose a two phase clustering approach for the above hypergraph, which is expected to be dense. In the first phase, the vertices of the hypergraph are merged in a multilevel algorithm to obtain large number of high quality clusters. Here, we propose new quality metrics for merging decisions in hypergraph clustering specifically for this domain. In order to enable the use of existing metrics in the second phase, we introduce a vertex-to-cluster affinity concept to devise a method for constructing a sparse hypergraph based on the obtained clustering. The experiments we have performed show the effectiveness of the proposed framework.
A Requirements Analysis for Parallel KDD Systems
, 2000
"... The current generation of data mining tools have limited capacity and performance, since these tools tend to be sequential. This paper explores a migration path out of this bottleneck by considering an integrated hardware and software approach to parallelize data mining. Our analysis shows that para ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
The current generation of data mining tools have limited capacity and performance, since these tools tend to be sequential. This paper explores a migration path out of this bottleneck by considering an integrated hardware and software approach to parallelize data mining. Our analysis shows that parallel data mining solutions require the following components: parallel data mining algorithms, parallel and distributed data bases, parallel file systems, parallel I/O, tertiary storage, management of online data, support for heterogeneous data representations, security, quality of service and pricing metrics. State of the art technology in these areas is surveyed with an eye towards an integration strategy leading to a complete solution.
A subspace clustering framework for research group collaboration
- International Journal of Information Technology and Web Engineering
, 2006
"... Researchers spend considerable time searching for relevant papers on the topic in which they are currently interested. Often, despite having similar interests, researchers in the same lab do not find it convenient to share results of bibliographic searches and thus conduct independent time-consuming ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Researchers spend considerable time searching for relevant papers on the topic in which they are currently interested. Often, despite having similar interests, researchers in the same lab do not find it convenient to share results of bibliographic searches and thus conduct independent time-consuming searches. Research paper recommender systems can help the researcher avoid such time-consuming searches by allowing each researcher to automatically take advantage of previous searches performed by others in the lab. Existing recommender systems were developed for commercial domains to assist users by focussing towards products of their interests. Unlike those domains, the research paper domain has relatively few users when compared with the huge number of research papers. In this paper we present a novel system to recommend relevant research papers to a user based on the user’s recent querying and browsing habits. The core of the system is a scalable subspace clustering algorithm (SCuBA 1) that performs well on the sparse, high-dimensional data collected in this domain. Both synthetic and benchmark datasets are used to evaluate the recommendation system and to demonstrate that it performs better than the traditional collaborative filtering approaches when recommending research papers.
Distance Based Subspace Clustering with Flexible Dimension Partitioning
"... Traditional similarity or distance measurements usually become meaningless when the dimensions of the datasets increase, which has detrimental effects on clustering performance. In this paper, we propose a distance-based subspace clustering model, called nCluster, to find groups of objects that have ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Traditional similarity or distance measurements usually become meaningless when the dimensions of the datasets increase, which has detrimental effects on clustering performance. In this paper, we propose a distance-based subspace clustering model, called nCluster, to find groups of objects that have similar values on subsets of dimensions. Instead of using a grid based approach to partition the data space into non-overlapping rectangle cells as in the density based subspace clustering algorithms, the nCluster model uses a more flexible method to partition the dimensions to preserve meaningful and significant clusters. We develop an efficient algorithm to mine only maximal nClusters. A set of experiments are conducted to show the efficiency of the proposed algorithm and the effectiveness of the new model in preserving significant clusters. 1
The framework for approximate queries on simulation data
- Elsevier Sciences
, 2003
"... AQSim is a system intended to enable scientists to query and analyze a large volume of scientific simulation data. The system uses the state of the art in approximate query processing techniques to build a novel framework for progressive data analysis. These techniques are used to define a multi-res ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
AQSim is a system intended to enable scientists to query and analyze a large volume of scientific simulation data. The system uses the state of the art in approximate query processing techniques to build a novel framework for progressive data analysis. These techniques are used to define a multi-resolution index, where each node contains multiple models of the data. The benefits of these models are two-fold: 1) they are compact representations, reconstructing only the information relevant to the analysis, and 2) the variety of models capture different aspects of the data which may be of interest to the user but are not readily apparent in their raw form. To be able to deal with the data interactively, AQSim allows the scientist to make an informed tradeoff between query response accuracy and time. In this paper, we present the framework of AQSim with a focus on its architectural design. We also show the results from an initial proof-of-concept prototype developed at LLNL. The presented framework is generic enough to handle more than just simulation data.
Research Paper Recommender Systems: A Subspace Clustering Approach
- IN INTERNATIONAL CONFERENCE ON WEB-AGE INFORMATION MANAGEMENT (WAIM
, 2005
"... Researchers from the same lab often spend a considerable amount of time searching for published articles relevant to their current project. Despite having similar interests, they conduct independent, time consuming searches. While they may share the results afterwards, they are unable to leverage pr ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Researchers from the same lab often spend a considerable amount of time searching for published articles relevant to their current project. Despite having similar interests, they conduct independent, time consuming searches. While they may share the results afterwards, they are unable to leverage previous search results during the search process. We propose a research paper recommender system that avoids such time consuming searches by augmenting existing search engines with recommendations based on previous searches performed by others in the lab. Most existing recommender systems were developed for commercial domains with millions of users. The research paper domain has relatively few users compared to the large number of online research papers. The two major challenges with this type of data are the large number of dimensions and the sparseness of the data. The novel contribution of the paper is a scalable subspace clustering algorithm (SCuBA 1)thattackles these problems. Both synthetic and benchmark datasets are used to evaluate the clustering algorithm and to demonstrate that it performs better than the traditional collaborative filtering approaches when recommending research papers.
QROCK: A Quick Version of the ROCK Algorithm for Clustering of Categorical Data
"... The ROCK algorithm is an agglomerative hierarchical clustering algorithm for clustering categorical data [9]. In this paper we prove that under certain conditions, the final clusters obtained by the algorithm are nothing but the connected components of a certain graph with the input data-points a ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The ROCK algorithm is an agglomerative hierarchical clustering algorithm for clustering categorical data [9]. In this paper we prove that under certain conditions, the final clusters obtained by the algorithm are nothing but the connected components of a certain graph with the input data-points as vertices. We propose a new algorithm QROCK which computes the clusters by determining the connected components of the graph. This leads to a very efficient method of obtaining the clusters giving a drastic reduction of the computing time of the ROCK algorithm. We also justify that it is more practical for specifying the similarity threshold rather than specifying the desired number of clusters a priori. The QROCK algorithm also detects the outliers in this process. We also discuss a new simila rity measure for categorical attributes.
LogView: Visualizing Event Log Clusters
"... Abstract — Event logs or log files form an essential part of any network management and administration setup. While log files are invaluable to a network administrator, the vast amount of data they sometimes contain can be overwhelming and can sometimes hinder rather than facilitate the tasks of a n ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract — Event logs or log files form an essential part of any network management and administration setup. While log files are invaluable to a network administrator, the vast amount of data they sometimes contain can be overwhelming and can sometimes hinder rather than facilitate the tasks of a network administrator. For this reason several event clustering algorithms for log files have been proposed, one of which is the event clustering algorithm proposed by Risto Vaarandi, on which his Simple Log file Clustering Tool (SLCT) is based. The aim of this work is to develop a visualization tool that can be used to view log files based on the clusters produced by SLCT. The proposed visualization tool, which is called LogView, utilizes treemaps to visualize the hierarchical structure of the clusters produced by SLCT. Our results based on different application log files show that LogView can ease the summarization of vast amount of data contained in the log files. This in turn can help to speed up the analysis of event data in order to detect any security issues on a given application. I.

