| Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence, pages 245--260, 2000. |
....the quality of clustering is heavily dependent on grid size and density threshold parameters. A survey of parallel algorithms for hierarchical clustering using distance based metrics is given in [Ols95] These are more theoretical PRAM algorithms. Recently, k means algorithm has been parallelized [DM99] but is limited, however, in its applicability, as it requires the user to specify k, the number of clusters, and also does not find clusters in subspaces. Clusters are unions of connected high density cells. Two k dimensional cells are connected if they have a common face in the k dimensional ....
I.S. Dhillon and D.S. Modha. A data-clustering algorithm on distributed memory multiprocessors. Large-Scale Parallel KDD Systems, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.
.... widely investigated problems in many fields (such as Machine Learning, Data Mining, Computational Geometry, and of course Information Retrieval) However, there is no clear indication as to whether or not existing algorithms could effectively be employed in large scale web applications (see, e.g. [4, 5], for a discussion of the difficulties connected to the efficient clustering of very large document collections and for sequential and distributed algorithms with state of the art performances) In this paper we isolate a problem, which we call Minimum Redirections Problem, related to the ....
I. S. Dhillon and D. S. Modha, A data clustering algorithm on distributed memory multiprocessing, Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence, Volume 1759, pp. 245-260, 2000.
....fact, if any clusters are changed, J is reduced. As J is bounded from below it converges and as a consequence the algorithm converges. It is also known that the k means will always converge to a local minimum [17] The k means algorithm may be viewed as a variant of the EM algorithm [56] In [26] a parallel k means algorithm is proposed. The authors also provide a careful analysis of the algorithm s computational complexity. There are two Algorithm 7 k means algorithm Select k arbitrary data points z 1 , z k . repeat T i : z i ) z s ) s = 1, p z i ....
....Finding the minimum for each point requires at total of kN comparisons, then one needs to compute the new average for each cluster which requires nd additions and kd divisions. The cost is usually dominated in data mining by the costs for the determination of all the distances and thus the time is [26]: T = O(NkdI) where I is the number of iterations. For the parallel algorithm (shared nothing) the data is initially distributed over the discs of all the processors. Then each processor computes the distances of its elements to all cluster centers. This is done in parallel and so the most ....
I.S. Dhillon and D.S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Zaki and Ho [73].
....tree [8] Several researches study techniques for parallelizing clustering algorithms, which can be considered as the unsupervised learning problem. Ruocco and Frieder [15] propose parallel single link and single pass algorithm for clustering documents worked on an Intel Paragon. Dhillon and Modha [3] introduce an effective parallelization of the k means clustering algorithm implemented on an IBM POWERparallel SP2. Forman and Zhang [4] also present a general technique for parallelizing a class of center based clustering algorithms including k means, k harmonic means, and EM algorithm performed ....
Dhillon, I.S., and Modha, D.S. A data-clustering algorithm on distributed memory multiprocessors. Large-Scale Parallel Data Mining, pages 245-260, 1999.
....Clustering [2] Since the volume of data stored and searched in today s information systems is vast, high performance computing, in this case parallel processing, is needed to sustain acceptable clustering times. Some recent parallel text clustering e#orts include the works by Dhillon, et. al [4] and Ruocco Frieder [6] Dhillon, et. al parallelized the spherical k means partitioning algorithm and achieved near linear speedup and scaleup when running on test collections of documents from 8 128 terms in length, the largest of which was 2GB [3] Ruocco and Frieder [6] developed a near ....
Dhillon, I. S., and Modha, D. S. A data-clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence (2000), pp. 245--260.
....centroids of newly assembled groups. Iterations continue until a stopping criterion is achieved (for example, no reassignments happen) This version is known as Forgy s algorithm [For65] and has many advantages: It easily works with any Lp norm It allows straightforward parallelization [DM99] It is insensitive with respect to data ordering. Another version of k means iterative optimization reassigns points based on more detailed analysis of effects on the objective function caused by moving a point from its current cluster to a potentially new one. If a move has a positive effect, ....
Dhillon, I. and Modha, D. A data clustering algorithm on distributed memory multiprocessors. Proceedings of Large-scale Parallel KDD Systems Workshop, ACM SIGKDD, 1999.
....it, assign this point to the corresponding cluster, and then move the center of the cluster closer to this point, and ffl Repeat this process until the assignment of points to cluster does not change. This method can also be parallelized in a fashion very similar to the previous two techniques [6, 17, 38]. The data instances are partitioned among the nodes. Each node processes the data instances it owns. Instead of moving the center of the cluster immediately after the data instance is assigned to the cluster, the local sum of movements of each center due to all points owned on that node is ....
....research efforts. Significant amount of work has been done on parallelization of individual data mining techniques that can be parallelized through our approach. Most of the work has been on distributed memory machines, including association mining [4, 7, 22, 23] k Means clustering technique [6, 17, 38], and bayesian networks [20] Our work is significantly different, because we offer an interface and runtime support to parallelize each of these algorithms. Shared memory parallelization of association mining rules has also been an area of attention. Parthasarathy et al. have developed a number ....
Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In In Proceedings of Workshop on Large-Scale Parallel KDD Systems, in conjunction with the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 99), pages 47 -- 56, August 1999.
....6 Related Work We now compare our work with related research efforts. Significant amount of work has been done on parallelization of individual data mining techniques. Most of the work has been on distributed memory machines, including association mining [1, 11, 12, 29] k means clustering [9], and decision tree classifiers [3, 10, 15, 24, 26] Recent efforts have also focused on shared memory parallelization of data mining algorithms, including association mining [28, 19, 20] and decision tree construction [27] Our work is significantly different, because we offer an interface and ....
Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In In Proceedings of Workshop on Large-Scale Parallel KDD Systems, in conjunction with the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 99), pages 47 -- 56, August 1999.
....processors. The technology trend is for processors to improve faster than networks are improving, making the network a greater bottleneck in the future. Also, their algorithm limits the number of computing units to the number of clusters to be found. The parallel algorithm by Dhillon and Modha [5] was discovered independently and is an example of the class of parallel algorithms we described in [17] This paper extends their result to 8 times as many processors and covers two other algorithms. In our previous paper [17] we described a parallel decomposition of a class of iterative ....
Dhillon, I.S. and Modha, D.S. "A data clustering algorithm on distributed memory machines," ACM SIGKDD Workshop on Large-Scale Parallel KDD Systems (with KDD99), August, 1999.
....are embedded. It also requires input of entropy thresholds which is not intuitive for the user. A survey of parallel algorithms for hierarchical clustering using distance based metrics is given in [15] These are more theoretical PRAM algorithms. Recently, k means algorithm has been parallelized [5], but is limited however in its applicability, as it requires the user to specify k, the number of clusters, and also does not find clusters in subspaces. 3. Density and Grid based Clustering Density based approaches regard clusters as higher density regions than their surroundings. A common way ....
I. Dhillon and D. Modha. A data-clustering algorithm on distributed memory multiprocessors. Large-Scale Parallel KDD Systems, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.
....A survey of parallel algorithms for hierarchical clustering using distance based metrics is given in [Ols95] These are more theoretical, PRAM algorithms, now more of academic interest than practical parallel algorithms for large data sets. Recently, k means algorithm has been parallelized [DM99] but is limited however in its applicability, as it requires the user to specify k, the number of clusters, and also does not find clusters in subspaces. To the best of our knowledge, pMAFIA is one of the first efforts in practical parallel clustering techniques for large data sets. 12 Chapter ....
I.S. Dhillon and D.S. Modha. A data-clustering algorithm on distributed memory multiprocessors. Large-Scale Parallel KDD Systems, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.
....algorithms. In [45] the author shows an adaptation of the SLINK [47] and other agglomerative hierarchical clustering algorithms to a multiprocessor environment to parallelize the clustering process. The PADMA system [31] o ers a distributed clustering system for homogeneous text data. In [15], the authors adapt the K Means algorithm to run in a parallel distributed environment. The Collective Hierarchical Clustering algorithm was proposed elsewhere [30] for generating hierarchical clusters from distributed and heterogeneous data. To the best of our knowledge there does not exist any ....
I. Dhillon and D. Modha. A data clustering algorithm on distributed memory multiprocessors. In Workshop on Large-Scale Parallel KDD Systems, 1999.
....identi ed by a pair of boundaries, is processed by each task of the SPMD program. The number of tasks involved in the execution may be greater than the number of physical processors, thus exploiting multitasking. This parallel formulation of our test case is similar to those described in [12,4], and requires a new consistent global state to be established once each scan of the whole dataset is completed. Our global state corresponds with the new positions reached by the K centers. These positions are determined by summing the vectors corresponding with the centers movements which were ....
I. S. Dhillon and D. S. Modha. A data clustering algorithm on distributed memory machines. In ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 1999.
....about the potential clusters in main memory as one scans the database, and then do further re nement of cluster centers within main memory. 2. Parallel implementations. k means is readily parallelizeable through data partitioning on distributed memory multicomputers, with little overhead [8]. At each iteration, the current locations of the k means is broadcast to all processors, who then independently perform the time consuming operation of nding the closest mean for each (local) data point, and nally send the (local) updates to the mean positions to a central processor that does a ....
Dhillon, I., Modha, D.: A data clustering algorithm on distributed memory multiprocessors. KDD Workshop on Large-Scale Parallel Systems (1999)
....site a second level clustering among the di erent models is performed to generate the overall cluster model. The second level clustering is performed among di erent clusters by exploiting their statistical representations. A k means clustering algorithm for distributed environment was reported in [10]. This algorithm notes the inherent data parallelism in the k means algorithm and asymptotically approaches near optimal performance. 3 The Fast Distributed Mining (FDM) algorithm [8] can be used for mining association rules from distributed, homogeneous data sets. FDM notes that in a ....
I. Dhillon and D. Modha. A data-clustering algorithm on distributed memory multiprocessors. Proceedings of the KDD'99 Workshop on High Performance Knowledge Discovery, 1999.
....parallel based approaches to the problem of scaling clustering algorithms up for use in KDD environments. In [7] the author shows an adaptation of the SLINK [2] and other agglomerative hierarchical clustering algorithms to a multiprocessor environment to parallelize the clustering process. In [8], the authors adapt the K Means algorithm to run in a parallel environment. The PADMA system [9, 10] achieves scalability by locating agents with the distributed data sources. An agent coordinating facilitator gives user requests to local agents that then access and analyze local data, returning ....
Dhillon, I., Modha, D.: A data clustering algorithm on distributed memory multiprocessors. In: Workshop on Large-Scale Parallel KDD Systems. (1999)
....the items are at the leaf levels in a hierarchy or taxonomy of items, and the goal is to discover rules involving concepts at multiple (and mixed) levels. They show that load balancing is crucial for performance on such large scale clusters. Clustering and Sequences Dhillon and Modha [DM99] parallelized the K Means clustering algorithm on a 16 node IBM SP2 distributed memory system. They exploit the inherent data parallelism of the K Means algorithm, by performing the pointto centroid distance calculations in parallel. They demonstrated linear speedup on a 2GB dataset. Zaki ....
I. S. Dhillon and D. S. Modha. A data clustering algorithm on distributed memory machines. In [ZH99].
No context found.
Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence, pages 245--260, 2000.
No context found.
I. S. Dhillon and D. S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Proceedings of Workshop on Large-Scale Parallel KDD Systems (in conjunction with SIGKDD), pages 245--260, August 1999.
No context found.
I. S. Dhillon and D. S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In M. Zaki and C. Ho, editors, Large Scale Parallel Data Mining, pages 245--260. LNCS vol 1759. Springer, 2000.
No context found.
I. S. Dhillon and D. S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In KDD, pages 245--260, 1999.
No context found.
I. S. Dhillon and D. S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In ACM SIGKDD, 1999.
No context found.
I.S. Dhillon and D.S. Modha, "A Data-Clustering Algorithm on Distributed Memory Multiprocessors," Proc. Workshop Large-Scale Parallel KDD Systems, in conjunction with the Fifth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '99), pp. 4756, Aug. 1999.
No context found.
I. S. Dhillon and D. S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Proceedings of Large-scale Parallel KDD Systems Workshop, ACM SIGKDD, Aug. 15-18 1999.
No context found.
Dhillon, I.S. and Modha, D.S. "A data clustering algorithm on distributed memory machines," ACM SIGKDD Workshop on Large-Scale Parallel KDD Systems (with KDD99), August 1999.
No context found.
Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In In Proceedings of Workshop on Large-Scale Parallel KDD Systems, in conjunction with the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 99), pages 47 -- 56, August 1999.
No context found.
Dhillon, I. S. and Modha, D. M., A Data Clustering Algorithm on Distributed Memory Multiprocessors, in Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence, Volume 1759, pages 245260, 2000.
No context found.
I. S. Dhillon and D. S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In KDD, pages 245--260, 1999.
No context found.
I.S. Dhillon and D.S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Zaki and Ho [73].
No context found.
Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In In Proceedings of Workshop on Large-Scale Parallel KDD Systems, in conjunction with the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 99), pages 47 -- 56, August 1999.
No context found.
Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In In Proceedings of Workshop on Large-Scale Parallel KDD Systems, in conjunction with the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 99), pages 47 - 56, August 1999.
No context found.
I. Dhillon and D. Modha. A Data-clustering Algorithm on Distributed Memory Multiprocessors. In Proceedings of the KDD'99 Workshop on High Performance Knowledge Discovery, pages 245--260, 1999.
No context found.
I. S. Dhillon and D. S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence, pages 245--260, 2000.
No context found.
I. S. Dhillon and D. S. Modha. A dataclustering algorithm on distributed memory multiprocesors. In M. J. Zaki and C.-T. Ho (eds), Large-Scale Parallel Data Mining, Springer-Verlag, LNCS 1759, pages 245-- 260, 1999.
No context found.
Dhillon I. S., Modh Dh. S.: "A Data-Clustering Algorithm On Distributed Memory Multiprocessors", Int. Conf. on Knowledge Discovery and Data Mining (SIGKDD 99) 98] Ester M., Kriegel H.-P., Sander J., WimmerM.,XuX.: "Incremental Clustering for Mining in a Data Warehousing Environment", VLDB 98
No context found.
I. Dhillon and D. Modha. A data clustering algorithm on distributed memory multiprocessors. In Workshop on Large-Scale Parallel KDD Systems, 1999.
No context found.
I.S. Dhillon and D.S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining, Lecture Notes in Artificial Intelligence, pages 245--260, 2000.
No context found.
Dhillon, I. S. and Modha, D. S. (1999). A data-clustering algorithm on distributed memory multiprocessors. In Proc. Large-scale Parallel KDD Systems Workshop, ACM SIGKDD.
No context found.
I. Dhillon and D. Modha, \A data clustering algorithm on distributed memory machines," in ACM SIGKDD Workshop on Large-Scale Parallel KDD Systems, Aug. 1999.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC