Results 11 -
14 of
14
The PNC 2 Cluster Algorithm - An integrated learning algorithm for rule induction
, 2003
"... This document describes the hierarchical agglomerative cluster algorithm Pnc 2 in the context of direct generation of If-Then rules for classification tasks. As an agglomerative cluster algorithm, the Pnc 2 initializes each learn data tuple as a single cluster. Then, if a merge test is passed, itera ..."
Abstract
- Add to MetaCart
This document describes the hierarchical agglomerative cluster algorithm Pnc 2 in the context of direct generation of If-Then rules for classification tasks. As an agglomerative cluster algorithm, the Pnc 2 initializes each learn data tuple as a single cluster. Then, if a merge test is passed, iteratively always those two clusters with the same output value are merged, that are closest to each other. The merge test transforms the generalized cluster into a rule and evaluates it by a kind of hitrate. The rule's premise is the cuboid, that encloses the input vectors of all learn data tuples merged in the cluster. This representation su#ers in high dimensional input spaces due to the COD problem and thus a special mechanism is used to extend the cuboid during the merge test.
SPoID: Do Not Throw Meaningful Incomplete Sequences Away!
"... Industrial databases often contain a large amount of unfilled information. During the knowledge discovery process one processing step is often necessary in order to remove these incomplete data either by deleting or assessing them. When the data mining task consists in mining for frequent sequences, ..."
Abstract
- Add to MetaCart
Industrial databases often contain a large amount of unfilled information. During the knowledge discovery process one processing step is often necessary in order to remove these incomplete data either by deleting or assessing them. When the data mining task consists in mining for frequent sequences, incomplete data are, most of the time, deleted, which leads to an important loss of information. Extracted knowledge then becomes less representative of the database. Therefore we propose a method that uses the partial information contained in incomplete records, only temporary ignoring the missing part of the record. Experiments run on various synthetic datasets show the validity of our proposal as well in terms of quality as in terms of the robustness to the rate of missing values.
Improving Accuracy and Coverage of Data Mining Systems that are Built from Noisy Datasets: A New Model
"... Abstract: Problem statement: Noise within datasets has to be dealt with under most circumstances. This noise includes misclassified data or information as well as missing data or information. Simple human error is considered as misclassification. These errors will decrease the accuracy of the data m ..."
Abstract
- Add to MetaCart
Abstract: Problem statement: Noise within datasets has to be dealt with under most circumstances. This noise includes misclassified data or information as well as missing data or information. Simple human error is considered as misclassification. These errors will decrease the accuracy of the data mining system so it will not be likely to be used. The objective was to propose an effective algorithm to deal with noise which is represented by missing data in datasets. Approach: A model for improving the accuracy and coverage of data mining systems was proposed and the algorithm of this model was constructed. The algorithm was dealing with missing values in datasets. It splits the original dataset into two new datasets; one contains tuples that have no missing values and the other one contains tuples that have missing values. The proposed algorithm was applied to each of the two new datasets. It finds the reduct of each of them and then it merges the new reducts into one new dataset which will be ready for training. Results: The results showed interesting as it increases the accuracy and coverage of the tested dataset compared to the traditional models. Conclusion: The proposed algorithm performs effectively and generates better results than the previous ones.
2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Learning Classifiers from Large Databases Using Statistical Queries
"... 1 We describe an approach to learning predictive models from large databases in settings where direct access to data is not available because of massive size of data, access restrictions, or bandwidth requirements. We outline some techniques for minimizing the number of statistical queries needed; a ..."
Abstract
- Add to MetaCart
1 We describe an approach to learning predictive models from large databases in settings where direct access to data is not available because of massive size of data, access restrictions, or bandwidth requirements. We outline some techniques for minimizing the number of statistical queries needed; and for efficiently coping with missing values in the data. We provide open source implementation of the decision tree and Naive bayes algorithms to demonstrate the feasibility of the proposed approach. 1 Learning Using Statistical Queries Advances in virtually every area of human endeavor are being increasingly driven by our ability to acquire knowledge from vast amounts of data. Most current approaches to learning from data assume direct access to data. However, in many practical applications, the large size, access restrictions, memory and bandwidth constraints, and in some instances, privacy considerations prohibit direct access to data. Hence, there is an urgent need for scalable approach to learning predictive models from large datasets (that cannot fit in the memory available on the device where the learning algorithm is executed). To address this need, especially in settings where the data reside in distributed repositories, Caragea et al. [3, 4] have introduced a general strategy for transforming a broad class of standard learning algorithms that assume in memory access to a dataset into algorithms that interact with the data source(s) only through statistical queries or procedures that can be executed on the remote data sources. This involves separating a learning algorithm into two components: (i) a statistical query 2 generation

