Results 1 - 10
of
25,903
MapReduce: Simplified data processing on large clusters.
- In Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI-04),
, 2004
"... Abstract MapReduce is a programming model and an associated implementation for processing and generating large data sets. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of ..."
Abstract
-
Cited by 3439 (3 self)
- Add to MetaCart
Abstract MapReduce is a programming model and an associated implementation for processing and generating large data sets. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details
GPFS: A Shared-Disk File System for Large Computing Clusters
- In Proceedings of the 2002 Conference on File and Storage Technologies (FAST
, 2002
"... GPFS is IBM's parallel, shared-disk file system for cluster computers, available on the RS/6000 SP parallel supercomputer and on Linux clusters. GPFS is used on many of the largest supercomputers in the world. GPFS was built on many of the ideas that were developed in the academic community ove ..."
Abstract
-
Cited by 521 (3 self)
- Add to MetaCart
existing ideas scaled well, new approaches were necessary in many key areas. This paper describes GPFS, and discusses how distributed locking and recovery techniques were extended to scale to large clusters.
CURE: An Efficient Clustering Algorithm for Large Data sets
- Published in the Proceedings of the ACM SIGMOD Conference
, 1998
"... Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering ..."
Abstract
-
Cited by 722 (5 self)
- Add to MetaCart
and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination
BIRCH: an efficient data clustering method for very large databases
- In Proc. of the ACM SIGMOD Intl. Conference on Management of Data (SIGMOD
, 1996
"... Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely st,udied problems in this area is the identification of clusters, or deusel y populated regions, in a multi-dir nensional clataset. Prior work does not adequately address the problem of ..."
Abstract
-
Cited by 576 (2 self)
- Add to MetaCart
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely st,udied problems in this area is the identification of clusters, or deusel y populated regions, in a multi-dir nensional clataset. Prior work does not adequately address the problem
Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections
, 1992
"... Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); and second, that clustering does not appreciably ..."
Abstract
-
Cited by 777 (12 self)
- Add to MetaCart
Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); and second, that clustering does not appreciably
A density-based algorithm for discovering clusters in large spatial databases with noise
, 1996
"... Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clu ..."
Abstract
-
Cited by 1786 (70 self)
- Add to MetaCart
Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery
Model-Based Clustering, Discriminant Analysis, and Density Estimation
- JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2000
"... Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little ..."
Abstract
-
Cited by 573 (29 self)
- Add to MetaCart
Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However
Automatic Subspace Clustering of High Dimensional Data
- Data Mining and Knowledge Discovery
, 2005
"... Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the or ..."
Abstract
-
Cited by 724 (12 self)
- Add to MetaCart
identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate clusters in large high dimensional datasets.
OPTICS: Ordering Points To Identify the Clustering Structure
, 1999
"... Cluster analysis is a primary method for database mining. It is either used as a stand-alone tool to get insight into the distribution of a data set, e.g. to focus further analysis and data processing, or as a preprocessing step for other algorithms operating on the detected clusters. Almost all of ..."
Abstract
-
Cited by 527 (51 self)
- Add to MetaCart
.g. representative points, arbitrary shaped clusters), but also the intrinsic clustering structure. For medium sized data sets, the cluster-ordering can be represented graphically and for very large data sets, we introduce an appropriate visualization technique. Both are suitable for interactive exploration
Results 1 - 10
of
25,903