Results 1 - 10
of
41
A Data-Clustering Algorithm On Distributed Memory Multiprocessors
- In Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence
, 2000
"... To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the k-means clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent data-parallelism in the k-means algorithm. We analyticall ..."
Abstract
-
Cited by 79 (1 self)
- Add to MetaCart
To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the k-means clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent data-parallelism in the k-means algorithm. We analytically show that the speedup and the scaleup of our algorithm approach the optimal as the number of data points increases. We implemented our algorithm on an IBM POWERparallel SP2 with a maximum of 16 nodes. On typical test data sets, we observe nearly linear relative speedups, for example, 15.62 on 16 nodes, and essentially linear scaleup in the size of the data set and in the number of clusters desired. For a 2 gigabyte test data set, our implementation drives the 16 node SP2 at more than 1.8 gigaflops. Keywords: k-means, data mining, massive data sets, message-passing, text mining. 1 Introduction Data sets measuring in gigabytes and even terabytes are now quite common in data and text minin...
MineBench: A Benchmark Suite for Data Mining Workloads
- 2006 IEEE International Symposium on Workload Characterization
, 2006
"... Data mining constitutes an important class of scientific and commercial applications. Recent advances in data extraction techniques have created vast data sets, which require increasingly complex data mining algorithms to sift through them to generate meaningful information. The disproportionately s ..."
Abstract
-
Cited by 35 (4 self)
- Add to MetaCart
Data mining constitutes an important class of scientific and commercial applications. Recent advances in data extraction techniques have created vast data sets, which require increasingly complex data mining algorithms to sift through them to generate meaningful information. The disproportionately slower rate of growth of computer systems has led to a sizeable performance gap between data mining systems and algorithms. The first step in closing this gap is to analyze these algorithms and understand their bottlenecks. With this knowledge, current computer architectures can be optimized for data mining applications. In this paper, we present MineBench, a publicly available benchmark suite containing fifteen representative data mining applications belonging to various categories such as clustering, classification, and association rule mining. We believe that MineBench will be of use to those looking to characterize and accelerate data mining workloads. 1.
Parallel Formulations of Decision-Tree Classification Algorithms
- DATA MINING AND KNOWLEDGE DISCOVERY: AN INTERNATIONAL JOURNAL
, 1998
"... Classification decision tree algorithms are used extensively for data mining in many domains such as retail target marketing, fraud detection, etc. Highly parallel algorithms for constructing classification decision trees are desirable for dealing with large data sets in reasonable amount of time. ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
Classification decision tree algorithms are used extensively for data mining in many domains such as retail target marketing, fraud detection, etc. Highly parallel algorithms for constructing classification decision trees are desirable for dealing with large data sets in reasonable amount of time. Algorithms for building classification decision trees have a natural concurrency, but are difficult to parallelize due to the inherent dynamic nature of the computation. In this paper, we present parallel formulations of classification decision tree learning algorithm based on induction. We describe two basic parallel formulations. One is based on Synchronous Tree Construction Approach and the other is based on Partitioned Tree Construction Approach. We discuss the advantages and disadvantages of using these methods and propose a hybrid method that employs the good features of these methods. We also provide the analysis of the cost of computation and communication of the proposed hybr...
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach
, 2008
"... Efficiently utilizing off-chip DRAM bandwidth is a critical issue in designing cost-effective, high-performance chip multiprocessors (CMPs). Conventional memory controllers deliver relatively low performance in part because they often employ fixed, rigid access scheduling policies designed for avera ..."
Abstract
-
Cited by 28 (4 self)
- Add to MetaCart
Efficiently utilizing off-chip DRAM bandwidth is a critical issue in designing cost-effective, high-performance chip multiprocessors (CMPs). Conventional memory controllers deliver relatively low performance in part because they often employ fixed, rigid access scheduling policies designed for average-case application behavior. As a result, they cannot learn and optimize the long-term performance impact of their scheduling decisions, and cannot adapt their scheduling policies to dynamic workload behavior. We propose a new, self-optimizing memory controller design that operates using the principles of reinforcement learning (RL) to overcome these limitations. Our RL-based memory controller observes the system state and estimates the long-term performance impact of each action it can take. In this way, the controller learns to optimize its scheduling policy on the fly to maximize long-term performance. Our results show that an RL-based memory controller improves the performance of a set of parallel applications run on a 4-core CMP by 19 % on average (up to 33%), and it improves DRAM bandwidth utilization by 22% compared to a state-of-the-art controller.
Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance
- In Proceedings of the second SIAM conference on Data Mining
, 2002
"... With recent technological advances, shared memory parallel machines have become more scalable, and oer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining alg ..."
Abstract
-
Cited by 22 (7 self)
- Add to MetaCart
With recent technological advances, shared memory parallel machines have become more scalable, and oer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms.
Strategies for parallel data mining
- IEEE Concurrency
, 1999
"... We present a set of cost measures that can be applied to parallel algorithms to predict their computation, data access, and communication performance. These measures make it possible to compare di erent possible parallel implementation strategies for data mining techniques without the necessity to b ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
We present a set of cost measures that can be applied to parallel algorithms to predict their computation, data access, and communication performance. These measures make it possible to compare di erent possible parallel implementation strategies for data mining techniques without the necessity to benchmark each one. We give general cost expressions for three common parallelizing strategies, and show how to instantiate these cost expressions for aparticular technique, neural networks.
A.: A parallel scalable infrastructure for OLAP and data mining
- In: International Database Engineering and Application Symposium
, 1999
"... Decision support systems are important in leveraging information present in data warehouses in businesses like banking, insurance, retail and health-care among many others. The multi-dimensional aspects of a business can be naturally expressed using a multi-dimensional data model. Data analysis and ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Decision support systems are important in leveraging information present in data warehouses in businesses like banking, insurance, retail and health-care among many others. The multi-dimensional aspects of a business can be naturally expressed using a multi-dimensional data model. Data analysis and data mining on these warehouses pose new challenges for traditional database systems. OLAP and data mining operations require summary information on these multi-dimensional data sets. Query processing for these applications require different views of data for analysis and effective decision making. Data mining techniques can be applied in conjunction with OLAP for an integrated business solution. As data warehouses grow, parallel processing techniques have been applied to enable the use of larger data sets and reduce the time for analysis, thereby enabling evaluation of many more options for decision making. In this paper we address (1) scalability in multidimensional systems for OLAP and multi-dimensional analysis, (2) integration of data mining with the OLAP framework, and (3) high performance by using parallel processing for OLAP and data mining. We describe our system PARSIMONY- Parallel and Scalable Infrastructure for Multidimensional Online analytical processing. This platform is used both for OLAP and data mining. Sparsity of data sets is handled by using sparse chunks using a bitencoded sparse structure for compression, which enables aggregate operations on compressed data. Techniques for effectively using summary information available in data cubes for data mining are presented for mining Association rules and decision-tree based Classification. Thesetake advantage of the data organization provided by the multidimensional data model. Performance results for high dimensional data sets on a distributed memory parallel machine (IBM SP-2) show good speedup and scalability.
PARSIMONY: An infrastructure for parallel multidimensional analysis and data mining
- Journal of Parallel and Distributed Computing
, 2001
"... Multidimensional analysis and online analytical processing (OLAP) operations require summary information on multidimensional data sets. Most common are aggregate operations along one or more dimensions of numerical data values. Simultaneous calculation of multidimensional aggregates are provided by ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Multidimensional analysis and online analytical processing (OLAP) operations require summary information on multidimensional data sets. Most common are aggregate operations along one or more dimensions of numerical data values. Simultaneous calculation of multidimensional aggregates are provided by the Data Cube operator, used to calculate and store summary information on a number of dimensions. This is computed only partially if the number of dimensions is large. Query processing for these applications requires different views of data to gain insight and for effective decision support. Queries may either be answered from a materialized cube in the data cube or calculated on the fly. The multidimensionality of the underlying problem can be represented both in relational and in multidimensional databases, the latter being a better fit when query performance is the criteria for judgment. Relational databases are scalable in size for OLAP and multidimensional analysis and efforts are on to make their performance acceptable. On the other hand multidimensional databases have proven to provide good performance for such queries, although they are not very scalable. In this article we address (1) scalability in multidimensional systems for OLAP and multidimensional analysis and (2) integration of data mining with the OLAP framework. We describe our 1
Performance and Memory-Access Characterization of Data Mining Applications
- WORKSHOP HELD IN CONJUNCTION WITH THE 31ST ANNUAL INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 1998
"... This paper characterizes the performance and memory-access behavior of a decision tree induction program, a previously unstudied application used in data mining and knowledge discovery in databases. Performance is studied via RSIM, an execution driven simulator, for three uniprocessor models that ex ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
This paper characterizes the performance and memory-access behavior of a decision tree induction program, a previously unstudied application used in data mining and knowledge discovery in databases. Performance is studied via RSIM, an execution driven simulator, for three uniprocessor models that exploit instruction level parallelism to varying degrees. Several properties of the program are noted. Out-of-order dispatch and multiple issue provide a significant performance advantage: 50%--250% improvement in IPC for out-of-order dispatch versus in-order dispatch, and 5%-- 120% improvement in IPC for four-way issue versus single issue. Multiple issue provides a greater performance improvement for larger L2 cache sizes, when the program is limited by CPU performance; out-of-order dispatch provides a greater performance improvement for smaller L2 cache sizes. The program has a very small instruction footprint: for an 8-kB L1 instruction cache the instruction miss rate is below 0.1%. A small (8 kB) L1 data cache is sufficient to capture most of the locality of the data references, resulting in L1 miss rates between 10%--20%. Increasing the size of the L2 data cache does not significantly improve performance until a significant fraction (over 1/4) of the dataset fits into the L2 cache. Lastly, a procedure is developed for scaling the cache sizes when using scaled-down datasets, allowing the results for smaller datasets to be used to predict the performance of full-sized datasets.
High-Performance Data Mining with Skeleton-based Structured Parallel Programming
- PARALLEL COMPUTING, SPECIAL ISSUE ON PARALLEL DATA INTENSIVE COMPUTING
, 2001
"... We show how to apply a Structured Parallel Programming methodology based on skeletons to Data Mining problems, reporting several results about three commonly used mining techniques, namely association rules, decision tree induction and spatial clustering. We analyze the structural patterns common to ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
We show how to apply a Structured Parallel Programming methodology based on skeletons to Data Mining problems, reporting several results about three commonly used mining techniques, namely association rules, decision tree induction and spatial clustering. We analyze the structural patterns common to these applications, looking at application performance and software engineering efficiency. Our aim is to clearly state what features a Structured Parallel Programming Environment should have to be useful for parallel Data Mining. Within the skeleton-based PPE SkIE that we have developed, we study the different patterns of data access of parallel implementations of Apriori, C4.5 and DBSCAN. We need to address large partitions reads, frequent and sparse access to small blocks, as well as an irregular mix of small and large transfers, to allow efficient development of applications on huge databases. We examine the addition of an object/component interface to the skeleton structured model, to simplify the development of environment-integrated, parallel Data Mining applications.

