Results 1 -
5 of
5
Automated Detection of Outliers in Real-World Data
- Proc. of the Second International Conference on Intelligent Technologies
, 2001
"... Most real-world databases include a certain amount of exceptional values, generally termed as "outliers". The isolation of outliers is important both for improving the quality of original data and for reducing the impact of outlying values in the process of knowledge discovery in databases ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
Most real-world databases include a certain amount of exceptional values, generally termed as "outliers". The isolation of outliers is important both for improving the quality of original data and for reducing the impact of outlying values in the process of knowledge discovery in databases. Most existing methods of outlier detection are based on manual inspection of graphically represented data. In this paper, we present a new approach to automating the process of detecting and isolating outliers. The process is based on modeling the human perception of exceptional values by using the fuzzy set theory. Separate procedures are developed for detecting outliers in discrete and continuous univariate data. The outlier detection procedures are demonstrated on several standard datasets of varying data quality.
Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System
"... Abstract:- The paper presents a new approach to sequential multi-stage combination of instance-based classifiers. Each classifier in the sequence requires more computational effort than the preceding classifier due to using a larger subset of features. A more complex classifier is activated only if ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract:- The paper presents a new approach to sequential multi-stage combination of instance-based classifiers. Each classifier in the sequence requires more computational effort than the preceding classifier due to using a larger subset of features. A more complex classifier is activated only if the confidence level of the preceding classifier is below a pre-defined threshold. The optimal threshold is found by maximizing a customizable fuzzy-based measure, called Performance Index (PI), which expresses the task-specific trade-off between classification accuracy and computational complexity. The approach is evaluated on a two-stage combination of k-Nearest Neighbor classifiers. The features to be used by the first classifier in the combination are found by a novel feature selection method, called “IFN + Relief. ” The PI measure is shown empirically to be an efficient tool for integrating accuracy and complexity considerations in the design of a multi-stage classification system.
Fuzzy Kernel Clustering with Outliers
"... Abstract: Outliers are data values that lie away from the general clusters of other data values. It may be that an outlier implies the most important feature of a dataset. In this paper, a new fuzzy kernel clustering algorithm is presented to locate the critical areas that are often represented by o ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract: Outliers are data values that lie away from the general clusters of other data values. It may be that an outlier implies the most important feature of a dataset. In this paper, a new fuzzy kernel clustering algorithm is presented to locate the critical areas that are often represented by only a few outliers. Through mercer kernel functions, the data in the original space are firstly mapped to a high-dimensional feature space. Then a modified objective function for fuzzy clustering is introduced in the feature space. An additional weighting factor is assigned to each vector in the feature space, and the weight value is updated using the iterative functions derived from the objective function. The final weight of a datum represents a kind of representativeness of the corresponding datum. With these weights, the experts can identify the outliers easily. The simulations demonstrate the feasibility of this method.
EUSFLAT- LFA 2005 Computing Temporal Trends in Web Documents
"... Most existing methods of web content mining assume a static nature of the web documents. This approach is inadequate for long-term monitoring and analysis of the web content, since both the users ' interests and the content of most web sites are subject to continuous changes over time. In this ..."
Abstract
- Add to MetaCart
(Show Context)
Most existing methods of web content mining assume a static nature of the web documents. This approach is inadequate for long-term monitoring and analysis of the web content, since both the users ' interests and the content of most web sites are subject to continuous changes over time. In this research, we are interested in developing computationally intelligent and efficient text mining techniques that will enable continuous comparison between documents provided by the same source (website, institute, organization, cult, author etc.) or viewed by the same group of users (e.g., university students) and timely detection of temporal trends in those documents. Our approach builds upon the recently developed methodology for fuzzy comparison of frequency distributions. The proposed techniques are evaluated on a real-world stream of web traffic.
Fuzzification and Reduction of Information-Theoretic Rule Sets
"... If-then rules are one of the most common forms of knowledge discovered by data mining methods. The number and the length of extracted rules tend to increase with the size of a database, making the rulesets less interpretable and useful. Existing methods of extracting fuzzy rules from numerical data ..."
Abstract
- Add to MetaCart
If-then rules are one of the most common forms of knowledge discovered by data mining methods. The number and the length of extracted rules tend to increase with the size of a database, making the rulesets less interpretable and useful. Existing methods of extracting fuzzy rules from numerical data improve the interpretability aspect, but the dimensionality of fuzzy rulesets remains high. In this paper, we present a new methodology for reducing the dimensionality of rulesets discovered in data. Our method builds upon the information-theoretic fuzzy approach to knowledge discovery. We start with constructing an information-theoretic network from a data table and extracting a set of association rules based on the network connections. The set of informationtheoretic rules is fuzzified and significantly reduced by using the principles of the Computational Theory of Perception (CTP). We demonstrate the method on a real-world database from semiconductor industry.