Results 1  10
of
29
Collective Data Mining: A New Perspective Toward Distributed Data Analysis
 Advances in Distributed and Parallel Knowledge Discovery
, 1999
"... This paper introduces the collective data mining (CDM) framework, a new approach toward distributed data mining (DDM) from heterogeneous sites. It points out that naive approaches to distributed data analysis in a heterogeneous environment may result in ambiguous or incorrect global data models. It ..."
Abstract

Cited by 105 (15 self)
 Add to MetaCart
This paper introduces the collective data mining (CDM) framework, a new approach toward distributed data mining (DDM) from heterogeneous sites. It points out that naive approaches to distributed data analysis in a heterogeneous environment may result in ambiguous or incorrect global data models. It also notes that any function can be expressed in a distributed fashion using a set of appropriate basis functions and orthogonal basis functions can be eectively used for developing a general DDM framework that guarantees correct local analysis and correct aggregation of local data models with minimal data communication. This paper develops the foundation of CDM, discusses decision tree learning and polynomial regression in CDM for discrete and continuous variables, and describes the BODHI, a CDMbased experimental system for distributed knowledge discovery. 1 Introduction Distributed data mining (DDM) is a fast growing area that deals with the problem of nding data patterns in a...
Distributed Data Mining: Algorithms, Systems, and Applications
, 2002
"... This paper presents a brief overview of the DDM algorithms, systems, applications, and the emerging research directions. The structure of the paper is organized as follows. We first present the related research of DDM and illustrate data distribution scenarios. Then DDM algorithms are reviewed. Subs ..."
Abstract

Cited by 70 (5 self)
 Add to MetaCart
This paper presents a brief overview of the DDM algorithms, systems, applications, and the emerging research directions. The structure of the paper is organized as follows. We first present the related research of DDM and illustrate data distribution scenarios. Then DDM algorithms are reviewed. Subsequently, the architectural issues in DDM systems and future directions are discussed
Distributed Clustering Using Collective Principal Component Analysis
 Knowledge and Information Systems
, 1999
"... This paper considers distributed clustering of high dimensional heterogeneous data using a distributed Principal Component Analysis (PCA) technique called the Collective PCA. It presents the Collective PCA technique that can be used independent of the clustering application. It shows a way to inte ..."
Abstract

Cited by 65 (9 self)
 Add to MetaCart
(Show Context)
This paper considers distributed clustering of high dimensional heterogeneous data using a distributed Principal Component Analysis (PCA) technique called the Collective PCA. It presents the Collective PCA technique that can be used independent of the clustering application. It shows a way to integrate the Collective PCA with a given otheshelf clustering algorithm in order to develop a distributed clustering technique. It also presents experimental results using dierent test data sets including an application for web mining.
Collective, Hierarchical Clustering from Distributed, Heterogeneous Data
, 1999
"... . This paper presents the Collective Hierarchical Clustering (CHC) algorithm for analyzing distributed, heterogeneous data. This algorithm first generates local cluster models and then combines them to generate the global cluster model of the data. The proposed algorithm runs in O(jSjn 2 ) tim ..."
Abstract

Cited by 53 (8 self)
 Add to MetaCart
. This paper presents the Collective Hierarchical Clustering (CHC) algorithm for analyzing distributed, heterogeneous data. This algorithm first generates local cluster models and then combines them to generate the global cluster model of the data. The proposed algorithm runs in O(jSjn 2 ) time, with a O(jSjn) space requirement and O(n) communication requirement, where n is the number of elements in the data set and jSj is the number of data sites. This approach shows significant improvement over naive methods with O(n 2 ) communication costs in the case that the entire distance matrix is transmitted and O(nm) communication costs to centralize the data, where m is the total number of features. A specific implementation based on the single link clustering and results comparing its performance with that of a centralized clustering algorithm are presented. An analysis of the algorithm complexity, in terms of overall computation time and communication requirements, is pres...
A survey on wavelet applications in data mining
 SIGKDD Explor. Newsl
"... Recently there has been significant development in the use of wavelet methods in various data mining processes. However, there has been written no comprehensive survey available on the topic. The goal of this is paper to fill the void. First, the paper presents a highlevel datamining framework tha ..."
Abstract

Cited by 37 (4 self)
 Add to MetaCart
(Show Context)
Recently there has been significant development in the use of wavelet methods in various data mining processes. However, there has been written no comprehensive survey available on the topic. The goal of this is paper to fill the void. First, the paper presents a highlevel datamining framework that reduces the overall process into smaller components. Then applications of wavelets for each component are reviewd. The paper concludes by discussing the impact of wavelets on data mining research and outlining potential future research directions and applications. 1.
Privacy preserving regression modelling via distributed computation
 In Proc. Tenth ACM SIGKDD Internat. Conf. on Knowledge Discovery and Data Mining
, 2004
"... www.niss.org ..."
(Show Context)
Collective Mining of Bayesian Networks from Distributed Heterogeneous Data
, 2002
"... We present a collective approach to learning a Bayesian network from distributed heterogenous data. In this approach, we first learn a local Bayesian network at each site using the local data. Then each site identifies the observations that are most likely to be evidence of coupling between local an ..."
Abstract

Cited by 25 (7 self)
 Add to MetaCart
(Show Context)
We present a collective approach to learning a Bayesian network from distributed heterogenous data. In this approach, we first learn a local Bayesian network at each site using the local data. Then each site identifies the observations that are most likely to be evidence of coupling between local and nonlocal variables and transmits a subset of these observations to a central site. Another Bayesian network is learnt at the central site using the data transmitted from the local site. The local and central Bayesian networks are combined to obtain a collective Bayesian network, that models the entire data. Experimental results and theoretical justification that demonstrate the feasibility of our approach are presented.
Distributed Data Mining: Scaling up and beyond
 In Advances in Distributed and Parallel Knowledge Discovery
, 1999
"... In this chapter I begin by discussing Distributed Data Mining (DDM) for scaling up, beginning by asking what scaling up means, questioning whether it is necessary, and then presenting a brief survey of what has been done to date. I then provide motivation beyond scaling up, arguing that DDM is a mor ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
In this chapter I begin by discussing Distributed Data Mining (DDM) for scaling up, beginning by asking what scaling up means, questioning whether it is necessary, and then presenting a brief survey of what has been done to date. I then provide motivation beyond scaling up, arguing that DDM is a more natural way to view data mining generally. DDM eliminates many difficulties encountered when coalescing alreadydistributed data for monolithic data mining, such as those associated with heterogeneity of data and with privacy restrictions. By viewing data mining as inherently distributed, important open research issues come into focus, issues that currently are obscured by the lack of explicit treatment of the process of producing monolithic data sets. I close with a discussion of the necessity of DDM for an efficient process of knowledge discovery.
The Case for Datacentric Grids
 In Workshop on Massively Parallel Programming, IPDPS2002
, 2001
"... We argue that the properties of online data make it effectively immovable. Massively parallel computations involving such data must therefore be constructed in a completely different way  one that replaces the processorcentric assumptions that underlie almost all programming models by datacentric ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
(Show Context)
We argue that the properties of online data make it effectively immovable. Massively parallel computations involving such data must therefore be constructed in a completely different way  one that replaces the processorcentric assumptions that underlie almost all programming models by datacentric assumptions. We discuss the implications of this change for grid architectures and their programming models.
A Scalable Local Algorithm for Distributed Multivariate Regression
, 2008
"... This paper offers a local distributed algorithm for multivariate regression in large peertopeer environments. The algorithm can be used for distributed inferencing, data compaction, data modeling and classification tasks in many emerging peertopeer applications for bioinformatics, astronomy, soc ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
This paper offers a local distributed algorithm for multivariate regression in large peertopeer environments. The algorithm can be used for distributed inferencing, data compaction, data modeling and classification tasks in many emerging peertopeer applications for bioinformatics, astronomy, social networking, sensor networks and web mining. Computing a global regression model from data available at the different peernodes using a traditional centralized algorithm for regression can be very costly and impractical because of the large number of data sources, the asynchronous nature of the peertopeer networks, and dynamic nature of the data/network. This paper proposes a twostep approach to deal with this problem. First, it offers an efficient local distributed algorithm that monitors the “quality ” of the current regression model. If the model is outdated, it uses this algorithm as a feedback mechanism for rebuilding the model. The local nature of the monitoring algorithm guarantees low monitoring cost. Experimental results presented in this paper strongly support the theoretical claims.