Results 1 - 10
of
13
Meta-Learning in Distributed Data Mining Systems: Issues and Approaches
- Advances of Distributed Data Mining
, 2000
"... Data mining systems aim to discover patterns and extract useful information from facts recorded in databases. A widely adopted approach to this objective is to apply various machine learning algorithms to compute descriptive models of the available data. Here, we explore one of the main challeng ..."
Abstract
-
Cited by 103 (0 self)
- Add to MetaCart
(Show Context)
Data mining systems aim to discover patterns and extract useful information from facts recorded in databases. A widely adopted approach to this objective is to apply various machine learning algorithms to compute descriptive models of the available data. Here, we explore one of the main challenges in this research area, the development of techniques that scale up to large and possibly physically distributed databases. Meta-learning is a technique that seeks to compute higher-level classifiers (or classification models), called meta-classifiers, that integrate in some principled fashion multiple classifiers computed separately over different databases. This study, describes meta-learning and presents the JAM system (Java Agents for Meta-learning), an agent-based meta-learning system for large-scale data mining applications. Specifically, it identifies and addresses several important desiderata for distributed data mining systems that stem from their additional complexity co...
Distributed Data Mining in Credit Card Fraud Detection
- IEEE Intelligent Systems
, 1999
"... Credit card transactions continue to grow in number, taking a larger share of the US payment system, and have led to a higher rate of stolen account numbers and subsequent losses by banks. Hence, improved fraud detection has become essential to maintain the viability of the US payment system. Ban ..."
Abstract
-
Cited by 79 (4 self)
- Add to MetaCart
Credit card transactions continue to grow in number, taking a larger share of the US payment system, and have led to a higher rate of stolen account numbers and subsequent losses by banks. Hence, improved fraud detection has become essential to maintain the viability of the US payment system. Banks have been fielding early fraud warning systems for some years. We seek to improve upon the state-of-the-art in commercial practice via large scale data mining. Scalable techniques to analyze massive amounts of transaction data to compute efficient fraud detectors in a timely manner is an important problem, especially for e-commerce. Besides scalability and efficiency, the fraud detection task exhibits technical problems that include skewed distributions of training data and non-uniform cost per error, both of which have not been widely studied in the KDD/DM community. In this article we survey and evaluate a number of techniques that we have proposed and implemented that address these three main issues concurrently. Our proposed methods of combining multiple learned fraud detectors under a "cost model" are general and demonstrably useful; our empirical results demonstrate that we can significantly reduce loss due to fraud through distributed data mining of fraud models. 1 1
Distributed Data Mining: Scaling up and beyond
- In Advances in Distributed and Parallel Knowledge Discovery
, 1999
"... In this chapter I begin by discussing Distributed Data Mining (DDM) for scaling up, beginning by asking what scaling up means, questioning whether it is necessary, and then presenting a brief survey of what has been done to date. I then provide motivation beyond scaling up, arguing that DDM is a mor ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
In this chapter I begin by discussing Distributed Data Mining (DDM) for scaling up, beginning by asking what scaling up means, questioning whether it is necessary, and then presenting a brief survey of what has been done to date. I then provide motivation beyond scaling up, arguing that DDM is a more natural way to view data mining generally. DDM eliminates many difficulties encountered when coalescing already-distributed data for monolithic data mining, such as those associated with heterogeneity of data and with privacy restrictions. By viewing data mining as inherently distributed, important open research issues come into focus, issues that currently are obscured by the lack of explicit treatment of the process of producing monolithic data sets. I close with a discussion of the necessity of DDM for an efficient process of knowledge discovery.
Multi-Database Mining
, 2003
"... Multi-database mining is an important research area because (1) there is an urgent need for analyzing data in different sources, (2) there are essential differences between mono- and multi-database mining, and (3) there are limitations in existing multi-database mining efforts. This paper designs a ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
Multi-database mining is an important research area because (1) there is an urgent need for analyzing data in different sources, (2) there are essential differences between mono- and multi-database mining, and (3) there are limitations in existing multi-database mining efforts. This paper designs a new multidatabase mining process. Some research issues involving mining multi-databases, including database clustering and local pattern analysis, are discussed.
Cost Complexity-based Pruning of Ensemble Classifiers
, 1999
"... . In this paper we study methods that combine multiple classification models learned over separate data sets. Numerous studies posit that such approaches provide the means to e#ciently scale learning to large datasets, while also boosting the accuracy of individual classifiers. These gains, however, ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
. In this paper we study methods that combine multiple classification models learned over separate data sets. Numerous studies posit that such approaches provide the means to e#ciently scale learning to large datasets, while also boosting the accuracy of individual classifiers. These gains, however, come at the expense of an increased demand for run-time system resources. The final ensemble meta-classifier may consist of a large collection of base classifiers that require increased memory resources while also slowing down classification throughput. Here, we describe an algorithm for pruning (i.e. discarding a subset of the available base classifiers) the ensemble meta-classifier as a means to reduce its size while preserving its accuracy and we present a technique for measuring the tradeo# between predictive performance and available run time system resources. The algorithm is independent of the method used initially when computing the meta-classifier. It is based on decision tree pruning methods and relies on the mapping of an arbitrary ensemble meta-classifier to a decision tree model. Through an extensive empirical study on meta-classifiers computed over two real data sets, we illustrate our pruning algorithm to be a robust and competitive approach to discarding classification models without degrading the overall predictive performance of an ensemble computed over those that remain after pruning. Keywords: distributed data mining, meta-learning, classifier evaluation, pruning, ensembles of classifiers, credit card fraud detection. 1.
A Comparative Evaluation of Meta-Learning Strategies over Large and Distributed Data Sets
- in Proceedings of the ICML-99 Workshop on Recent Advances in Meta-learning and Future Work
, 1999
"... There has been considerable interest recently in various approaches to scaling up machine learning systems to large and distributed data sets. We have been studying approaches based upon the parallel ap-plication of multiple learning programs at distributed sites, followed by a meta-learning stage t ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
(Show Context)
There has been considerable interest recently in various approaches to scaling up machine learning systems to large and distributed data sets. We have been studying approaches based upon the parallel ap-plication of multiple learning programs at distributed sites, followed by a meta-learning stage to combine the multiple models in a princi-pled fashion. In this paper, we empirically determine the “best ” data partitioning scheme for a selected data set to compose “appropriately-sized ” subsets and we evaluate and compare three different strategies, Voting, Stacking and Stacking with Correspondence Analysis (SCANN) for combining classification models trained over these subsets. We seek to find ways to efficiently scale up to large data sets while maintaining or improving predictive performance measured by the error rate, a cost model, and the TP-FP spread.
Data Mining with Distributed Agents in E-Commerce Applications
"... In this paper we describe the prototype of a yellow page service for customers in a distributed cyber-shopping mall. This application combines distributed data mining with agent technologies. The paper focuses on a framework to support distributed data mining. Data mining approaches have dealt with ..."
Abstract
- Add to MetaCart
(Show Context)
In this paper we describe the prototype of a yellow page service for customers in a distributed cyber-shopping mall. This application combines distributed data mining with agent technologies. The paper focuses on a framework to support distributed data mining. Data mining approaches have dealt with finding interesting patterns, however, there is little research on developing a framework for effective and efficient distributed data mining. Our approach to providing such a framework combines a concept hierarchy and an efficient, distributed encoding of that concept hierarchy with existing data mining methods. This marriage results in a new distributed data representation for data mining, called Combined Hierarchical Set (CHS). CHS provides a framework for knowledge discovery including discovery of generalized associations, aggregated associations, and combined associations.
DATA MINING Distributed Data Mining in Credit Card Fraud Detection
"... to grow in number, taking an ever-larger share of the US payment system and leading to a higher rate of stolen account numbers and subsequent losses by banks. Improved fraud detection thus has become essential to maintain the viability of the US payment system. Banks have used early fraud warning sy ..."
Abstract
- Add to MetaCart
to grow in number, taking an ever-larger share of the US payment system and leading to a higher rate of stolen account numbers and subsequent losses by banks. Improved fraud detection thus has become essential to maintain the viability of the US payment system. Banks have used early fraud warning systems for some years. Large-scale data-mining techniques can improve on the state of the art in commercial practice. Scalable techniques to analyze massive amounts of transaction data that efficiently compute fraud detectors in a timely manner is an important problem, especially for e-commerce. Besides scalability and efficiency, the fraud-detection task exhibits technical problems that include skewed distributions of training data and nonuniform cost per error, both of which have not been widely studied in the knowledge-discovery and datamining community. In this article, we survey and evaluate a number of techniques that address these three main issues concurrently. Our proposed methods of combining multiple learned fraud detectors under a “cost model ” are general and demonstrably useful; our empirical results demonstrate that we can significantly reduce loss due to fraud through distributed data mining of fraud models.
unknown title
"... This extended abstract describes a pruning algorithm that is independent of the metalearning algorithm and is used for discarding redundant classifiers without degrading the overall predictive performance of the pruned meta-classifier. To determine the most effective base classifiers, the algorithm ..."
Abstract
- Add to MetaCart
(Show Context)
This extended abstract describes a pruning algorithm that is independent of the metalearning algorithm and is used for discarding redundant classifiers without degrading the overall predictive performance of the pruned meta-classifier. To determine the most effective base classifiers, the algorithm takes advantage of the minimal cost-complexity pruning method of the CART learning algorithm [1] which guarantees to find the best (with respect to misclassification cost) pruned tree of a specific size (number of terminal nodes) of an initial unpruned decision tree.