• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Data Mining, Concepts and Techniques (2001)

by J Han, M Kamber
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 3,141
Next 10 →

Survey of clustering data mining techniques

by Pavel Berkhin , 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract - Cited by 408 (0 self) - Add to MetaCart
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
(Show Context)

Citation Context

...space A, where each component xil ∈ Al, l = 1 : d, is a numerical or nominal categorical attribute (i.e., feature, variable, dimension, component, field). For a discussion of attribute data types see =-=[106]-=-. Such point-by-attribute data format conceptually corresponds to a N ×d matrix and is used by a majority of algorithms reviewed below. However, data of other formats, such as variable length sequence...

Mapping the Gnutella network: Properties of large-scale peer-to-peer systems and implications for system design

by Matei Ripeanu, Ian Foster, Adriana Iamnitchi - IEEE Internet Computing Journal , 2002
"... Despite recent excitement generated by the peer-to-peer (P2P) paradigm and the surprisingly rapid deployment of some P2P applications, there are few quantitative evaluations of P2P systems behavior. The open architecture, achieved scale, and self-organizing structure of the Gnutella network make it ..."
Abstract - Cited by 361 (23 self) - Add to MetaCart
Despite recent excitement generated by the peer-to-peer (P2P) paradigm and the surprisingly rapid deployment of some P2P applications, there are few quantitative evaluations of P2P systems behavior. The open architecture, achieved scale, and self-organizing structure of the Gnutella network make it an interesting P2P architecture to study. Like most other P2P applications, Gnutella builds, at the application level, a virtual network with its own routing mechanisms. The topology of this virtual network and the routing mechanisms used have a significant influence on application properties such as performance, reliability, and scalability. We have built a “crawler” to extract the topology of Gnutella’s application level network. In this paper we analyze the topology graph and evaluate generated network traffic. Our two major findings are that: (1) although Gnutella is not a pure power-law network, its current configuration has the benefits and drawbacks of a power-law structure, and (2) the Gnutella virtual network topology does not match well the underlying Internet topology, hence leading to ineffective use of the physical networking infrastructure. These findings guide us to propose changes to the Gnutella protocol and implementations that may bring significant performance and scalability improvements. We believe that our findings as well as our measurement and analysis techniques have broad applicability to P2P systems and provide unique insights into P2P system design tradeoffs.
(Show Context)

Citation Context

...formation contained in domain names, while the second uses the clustering heuristic. Note that a large class of data mining and machine learning algorithms based on information gains (ID3, C4.5, etc. =-=[17]-=-) use a similar argument to build their decision trees. We performed this analysis on 10 topology graphs collected during February/March 2001. We detected no significant decrease in entropy after perf...

Distributed Clustering in Ad-hoc Sensor Networks: A Hybrid, Energy-Efficient Approach

by Ossama Younis, Sonia Fahmy , 2004
"... Prolonged network lifetime, scalability, and load balancing are important requirements for many ad-hoc sensor network applications. Clustering sensor nodes is an effective technique for achieving these goals. In this work, we propose a new energy-efficient approach for clustering nodes in adhoc sens ..."
Abstract - Cited by 307 (12 self) - Add to MetaCart
Prolonged network lifetime, scalability, and load balancing are important requirements for many ad-hoc sensor network applications. Clustering sensor nodes is an effective technique for achieving these goals. In this work, we propose a new energy-efficient approach for clustering nodes in adhoc sensor networks. Based on this approach, we present a protocol, HEED (Hybrid Energy-Efficient Distributed clustering), that periodically selects cluster heads according to a hybrid of their residual energy and a secondary parameter, such as node proximity to its neighbors or node degree. HEED does not make any assumptions about the distribution or density of nodes, or about node capabilities, e.g., location-awareness. The clustering process terminates in O(1) iterations, and does not depend on the network topology or size. The protocol incurs low overhead in terms of processing cycles and messages exchanged. It also achieves fairly uniform cluster head distribution across the network. A careful selection of the secondary clustering parameter can balance load among cluster heads. Our simulation results demonstrate that HEED outperforms weight-based clustering protocols in terms of several cluster characteristics. We also apply our approach to a simple application to demonstrate its effectiveness in prolonging the network lifetime and supporting data aggregation.
(Show Context)

Citation Context

...energy. The Zone Routing Protocol (ZRP) [29] for MANETs divides the network into overlapping, variable-sized zones. Several clustering techniques, such as K-Means, G-Means, or hierarchical clustering =-=[30]-=- have been proposed for partitioning datasets based on a parameter, e.g., distance. These approaches are not directly applicable to our problem because they iteratively optimize a cost function. This ...

On Clustering Validation Techniques

by Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis - Journal of Intelligent Information Systems , 2001
"... Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in large data sets. It has been subject of wide research since it arises in many application domains in engineering, business and social sciences. Esp ..."
Abstract - Cited by 299 (3 self) - Add to MetaCart
Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in large data sets. It has been subject of wide research since it arises in many application domains in engineering, business and social sciences. Especially, in the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains.
(Show Context)

Citation Context

...fy the cluster in which he/she can be classified and based on this decision his/her medication can be made. More specifically, some typical applications of the clustering are in the following fields (=-=Han & Kamber, 2001):s•-=- Business. In business, clustering may help marketers discover significant groups in their customers’ database and characterize them based on purchasing patterns. • Biology. In biology, it can be ...

Data Clustering: 50 Years Beyond K-Means

by Anil K. Jain , 2008
"... Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and m ..."
Abstract - Cited by 294 (7 self) - Add to MetaCart
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is exploratory in nature to find structure in data. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty of designing a general purpose clustering algorithm and the illposed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection, and data clustering and large scale data clustering.

Peer-to-Peer Architecture Case Study: Gnutella Network

by Matei Ripeanu , 2001
"... Despite recent excitement generated by the P2P paradigm and despite surprisingly fast deployment of some P2P applications, there are few quantitative evaluations of P2P systems behavior. Due to its' open architecture and achieved scale, Gnutella is an interesting P2P architecture case study. Gn ..."
Abstract - Cited by 274 (1 self) - Add to MetaCart
Despite recent excitement generated by the P2P paradigm and despite surprisingly fast deployment of some P2P applications, there are few quantitative evaluations of P2P systems behavior. Due to its' open architecture and achieved scale, Gnutella is an interesting P2P architecture case study. Gnutella, like most other P2P applications, builds' at the application level a virtual network with its' own routing mechanisms. The topology of this virtual network and the routing mechanisms used have a significant influence on application properties such as performance, reliability, and scalability. We built a 'crawler' to extract the topology of Gnutella's application level network. In this' paper we analyze the topology graph and evaluate generated network traffic. We find that although Gnutella is' not a pure power-law network, its' current configuration has the benefits' and drawbacks' of a power-law structure. These findings lead us to propose changes to Gnutella protocol and implementations that bring significant performance and scalability improvements'.

Toward integrating feature selection algorithms for classification and clustering

by Huan Liu, Lei Yu - IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING , 2005
"... This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals ..."
Abstract - Cited by 267 (21 self) - Add to MetaCart
This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals unattempted combinations, and provides guidelines in selecting feature selection algorithms. With the categorizing framework, we continue our efforts toward building an integrated system for intelligent feature selection. A unifying platform is proposed as an intermediate step. An illustrative example is presented to show how existing feature selection algorithms can be integrated into a meta algorithm that can take advantage of individual algorithms. An added advantage of doing so is to help a user employ a suitable algorithm without knowing details of each algorithm. Some real-world applications are included to demonstrate the use of feature selection in data mining. We conclude this work by identifying trends and challenges of feature selection research and development.
(Show Context)

Citation Context

...g Platform, Real-World Applications 11 Introduction As computer and database technologies advance rapidly, data accumulates in a speed unmatchable by human’s capacity of data processing. Data mining =-=[1, 29, 35, 36]-=-, as a multidisciplinary joint effort from databases, machine learning, and statistics, is championing in turning mountains of data into nuggets. Researchers and practitioners realize that in order to...

Incorporating Contextual Information in Recommender Systems Using a Multidimensional Approach

by Gediminas Adomavicius, Ramesh Sankaranarayanan, Shahana Sen, Alexander Tuzhilin - ACM Transactions on Information Systems , 2005
"... The paper presents a multidimensional (MD) approach to recommender systems that can provide recommendations based on additional contextual information besides the typical information on users and items used in most of the current recommender systems. This approach supports multiple dimensions, exten ..."
Abstract - Cited by 236 (9 self) - Add to MetaCart
The paper presents a multidimensional (MD) approach to recommender systems that can provide recommendations based on additional contextual information besides the typical information on users and items used in most of the current recommender systems. This approach supports multiple dimensions, extensive profiling, and hierarchical aggregation of recommendations. The paper also presents a multidimensional rating estimation method capable of selecting two-dimensional segments of ratings pertinent to the recommendation context and applying standard collaborative filtering or other traditional two-dimensional rating estimation techniques to these segments. A comparison of the multidimensional and two-dimensional rating estimation approaches is made, and the tradeoffs between the two are studied. Moreover, the paper introduces a combined rating estimation method that identifies the situations where the MD approach outperforms the standard two-dimensional approach and uses the MD approach in those situations and the standard two-dimensional approach elsewhere. Finally, the paper presents a pilot empirical study of the combined approach, using a multidimensional movie recommender system that was developed for implementing this approach and testing its performance. 1 1.

A Survey of Collaborative Filtering Techniques

by Xiaoyuan Su, Taghi M. Khoshgoftaar , 2009
"... As one of the most successful approaches to building recommender systems, collaborative filtering (CF) uses the known preferences of a group of users to make recommendations or predictions of the unknown preferences for other users. In this paper, we first introduce CF tasks and their main challenge ..."
Abstract - Cited by 216 (0 self) - Add to MetaCart
As one of the most successful approaches to building recommender systems, collaborative filtering (CF) uses the known preferences of a group of users to make recommendations or predictions of the unknown preferences for other users. In this paper, we first introduce CF tasks and their main challenges, such as data sparsity, scalability, synonymy, gray sheep, shilling attacks, privacy protection, etc., and their possible solutions. We then present three main categories of CF techniques: memory-based, model-based, and hybrid CF algorithms (that combine CF with other recommendation techniques), with examples for representative algorithms of each category, and analysis of their predictive performance and their ability to address the challenges. From basic techniques to the state-of-the-art, we attempt to present a comprehensive survey for CF techniques, which can be served as a roadmap for research and practice in this area.

Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures

by Hui Ding, Goce Trajcevski, Xiaoyue Wang, Eamonn Keogh
"... The last decade has witnessed a tremendous growths of interests in applications that deal with querying and mining of time series data. Numerous representation methods for dimensionality reduction and similarity measures geared towards time series have been introduced. Each individual work introduci ..."
Abstract - Cited by 141 (24 self) - Add to MetaCart
The last decade has witnessed a tremendous growths of interests in applications that deal with querying and mining of time series data. Numerous representation methods for dimensionality reduction and similarity measures geared towards time series have been introduced. Each individual work introducing a particular method has made specific claims and, aside from the occasional theoretical justifications, provided quantitative experimental observations. However, for the most part, the comparative aspects of these experiments were too narrowly focused on demonstrating the benefits of the proposed methods over some of the previously introduced ones. In order to provide a comprehensive validation, we conducted an extensive set of time series experiments re-implementing 8 different representation methods and 9 similarity measures and their variants, and testing their effectiveness on 38 time series data sets from a wide variety of application domains. In this paper, we give an overview of these different techniques and present our comparative experimental findings regarding their effectiveness. Our experiments have provided both a unified validation of some of the existing achievements, and in some cases, suggested that certain claims in the literature may be unduly optimistic. 1.
(Show Context)

Citation Context

...of interest in querying and mining such data which, in turn, resulted in a large amount of work introducing new methodologies for indexing, classification, clustering and approximation of time series =-=[13,17,22]-=-. Two key aspects for achieving effectiveness and efficiency when managing time series data are representation methods and similarity measures. Time series are essentially high dimensional data [17] a...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University