Results 1  10
of
13
A Fourier Spectrumbased Approach to Represent Decision Trees for Mining Data Streams in Mobile Environments
 IEEE Transactions on Knowledge and Data Engineering
, 2004
"... This paper presents a novel Fourier analysisbased technique to aggregate, transmit, and visualize decision trees in a mobile environment. Fourier representation of a decision tree has several interesting properties that are particularly useful for mining continuous data streams from small mobile ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
This paper presents a novel Fourier analysisbased technique to aggregate, transmit, and visualize decision trees in a mobile environment. Fourier representation of a decision tree has several interesting properties that are particularly useful for mining continuous data streams from small mobile computing devices. This paper presents algorithms to compute the Fourier spectrum of a decision tree and the vice versa. It offers a framework to aggregate decision trees in their Fourier representations. It also describes MobiMine, a mobile data mining system for mining stockmarket data from handheld devices connected over lowbandwidth wireless networks.
Mining Databases with Different Schemas: Integrating Incompatible Classifiers
 Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 1998
"... Abstract Distributed data mining systems aim to discover (and combine) usefull information that is distributed across multiple databases. The JAM system, for example, applies machine learning algorithms to compute models over distributed data sets and employs metalearning techniques to combine the ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
(Show Context)
Abstract Distributed data mining systems aim to discover (and combine) usefull information that is distributed across multiple databases. The JAM system, for example, applies machine learning algorithms to compute models over distributed data sets and employs metalearning techniques to combine the multiple models. Occasionally, however, these models (or classifiers) are induced from databases that have (moderately) different schemas and hence are incompatible. In this paper, we investigate the problem of combining multiple models computed over distributed data sets with different schemas. Through experiments performed on actual credit card data provided by two different financial institutions, we evaluate the effectiveness of the proposed approaches and demonstrate their potential utility.
Learning with Nonuniform Class and Cost Distributions: Effects and a Distributed Multiclassifier Approach
 In Workshop Notes KDD98 Workshop on Distributed Data Mining
, 1998
"... Many factors influence a learning process and the performance of a learned classifier. In this paper we investigate the effects of class distribution in the training set on performance. We also study different methods of measuring performance based on cost models and the performance effects of train ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
Many factors influence a learning process and the performance of a learned classifier. In this paper we investigate the effects of class distribution in the training set on performance. We also study different methods of measuring performance based on cost models and the performance effects of training class distribution with respect to the different cost models. Observations from these effects help us devise a distributed multiclassifier metalearning approach to learn in domains with skewed class distributions, nonuniform cost per error, and large amounts of data. One such domain is credit card fraud detection and our empirical results indicate that our approach can significantly reduce loss due to illegitimate transactions. Introduction Inductive learning research has been focusing on devising algorithms that generate highly accurate classifiers. Many factors contribute to the success of a learning process and hence the quality of the learned classifier. One factor is the class d...
Racing committees for large datasets
 In Proceedings of the International Conference on Discovery Science
, 2002
"... Abstract. This paper proposes a method for generating classifiers from large datasets by building a committee of simple base classifiers using a standard boosting algorithm. It permits the processing of large datasets even if the underlying base learning algorithm cannot efficiently do so. The basic ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
Abstract. This paper proposes a method for generating classifiers from large datasets by building a committee of simple base classifiers using a standard boosting algorithm. It permits the processing of large datasets even if the underlying base learning algorithm cannot efficiently do so. The basic idea is to split incoming data into chunks and build a committee based on classifiers built from these individual chunks. Our method extends earlier work by introducing a method for adaptively pruning the committee. This is essential when applying the algorithm in practice because it dramatically reduces the algorithm’s running time and memory consumption. It also makes it possible to efficiently “race ” committees corresponding to different chunk sizes. This is important because our empirical results show that the accuracy of the resulting committee can vary significantly with the chunk size. They also show that pruning is indeed crucial to make the method practical for large datasets in terms of running time and memory requirements. Surprisingly, the results demonstrate that pruning can also improve accuracy. 1
Discovering Accurate and Interesting Classification Rules Using Genetic Algorithm
"... Discovering accurate and interesting classification rules is a significant task in the postprocessing stage of a data mining (DM) process. Therefore, an optimization problem exists between the accuracy and the interesting metrics for postprocessing rule sets. To achieve a balance, in this paper, w ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Discovering accurate and interesting classification rules is a significant task in the postprocessing stage of a data mining (DM) process. Therefore, an optimization problem exists between the accuracy and the interesting metrics for postprocessing rule sets. To achieve a balance, in this paper, we propose two major postprocessing tasks. In the first task, we use a genetic algorithm (GA) to find the best combination of rules that maximizes the predictive accuracy on the sample training set. Thus we obtain the maximized accuracy. In the second task, we rank the rules by assigning objective rule interestingness (RI) measures (or weights) for the rules in the rule set. Henceforth, we propose a pruning strategy using a GA to find the best combination of interesting rules with the maximized (or greater) accuracy. We tested our implementation on three data sets. The results are very encouraging; they demonstrate the applicability and effectiveness of our approach.
Dataset complexity in gene expression based cancer classification using ensembles of knearest neighbors
 Artificial Intelligence in Medicine (in this issue
"... Abstract. When applied to supervised classification problems, dataset complexity determines how difficult a given dataset to classify. Since complexity is a nontrivial issue, it is typically defined by a number of measures. In this paper, we explore complexity of three gene expression datasets used ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract. When applied to supervised classification problems, dataset complexity determines how difficult a given dataset to classify. Since complexity is a nontrivial issue, it is typically defined by a number of measures. In this paper, we explore complexity of three gene expression datasets used for twoclass cancer classification. We demonstrate that estimating the dataset complexity before performing actual classification may provide a hint whether to apply a single best nearest neighbour classifier or an ensemble of nearest neighbour classifiers.
Using CorrelationBased Measures to Select Classifiers for Decision Fusion
"... This paper explores classifier fusion problems where the task is selecting a subset of classifiers from a larger set with the goal to achieve optimal performance. To aid in the selection process we propose the use of several correlationbased diversity measures. We define measures that capture the co ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
This paper explores classifier fusion problems where the task is selecting a subset of classifiers from a larger set with the goal to achieve optimal performance. To aid in the selection process we propose the use of several correlationbased diversity measures. We define measures that capture the correlation for n classifiers as opposed to pairs of classifiers only. We then suggest a sequence of steps in selecting classifiers. This method avoids the exhaustive evaluation of all classifier combinations which can become very large for larger sets of classifiers. We then report on observations made after applying that method to a data set from a realworld application. The classifier set chosen achieves close to optimal performance with a drastically reduced set of evaluation steps.
Multiclass cancer classification using ensembles of classifiers: Preliminary results
 In Proceedings of the Workshop on Probabilistic Modeling and Machine Learning in Structural and Systems Biology
"... Abstract. In our companion paper [10], we described in detail ensembles of nearest neighbours applied for cancer classification using serial analysis of gene expression (SAGE) data. Our results demonstrated the superiority of these ensembles over a single best classifier. Here, we supplement our mac ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract. In our companion paper [10], we described in detail ensembles of nearest neighbours applied for cancer classification using serial analysis of gene expression (SAGE) data. Our results demonstrated the superiority of these ensembles over a single best classifier. Here, we supplement our machine learning experiments with the attempt to biologically interpret the obtained results. In particular, we are interested to verify if the selected genes are ubiquitous and associated with cancers. 1
Conference on Data Mining  DMIN'06  389 Discovering Accurate and Interesting Classification Rules Using Genetic Algorithm
"... Discovering accurate and interesting classification rules is a significant task in the postprocessing stage of a data mining (DM) process. Therefore, an optimization problem exists between the accuracy and the interesting metrics for postprocessing rule sets. To achieve a balance, in this paper, w ..."
Abstract
 Add to MetaCart
(Show Context)
Discovering accurate and interesting classification rules is a significant task in the postprocessing stage of a data mining (DM) process. Therefore, an optimization problem exists between the accuracy and the interesting metrics for postprocessing rule sets. To achieve a balance, in this paper, we propose two major postprocessing tasks. In the first task, we use a genetic algorithm (GA) to find the best combination of rules that maximizes the predictive accuracy on the sample training set. Thus we obtain the maximized accuracy. In the second task, we rank the rules by assigning objective rule interestingness (RI) measures (or weights) for the rules in the rule set. Henceforth, we propose a pruning strategy using a GA to find the best combination of interesting rules with the maximized (or greater) accuracy. We tested our implementation on three data sets. The results are very encouraging; they demonstrate the applicability and effectiveness of our approach.
Choosing Classifiers for Decision Fusion
 Proceedings of the Seventh International Conference on Information Fusion
, 2004
"... This paper investigates the use of the rcorrelation as a measure for classifier diversity to aid in the choice of classifiers for a fusion ensemble. Specifically, we define a measure that captures the correlation for n classifiers for binary output as well as for classifier with continuous output. ..."
Abstract
 Add to MetaCart
This paper investigates the use of the rcorrelation as a measure for classifier diversity to aid in the choice of classifiers for a fusion ensemble. Specifically, we define a measure that captures the correlation for n classifiers for binary output as well as for classifier with continuous output. We then suggest the use of the rcorrelation in classifier selection where classifiers are picked sequentially from a larger pool of classifiers without the need to exhaustively calculate the performance of all possible permutations. We show that this simple method will give close to optimal classification performance. We present examples from real applications for both binary as well as continuous out classifiers.