Results 11 - 20
of
29
Class Noise Mitigation through Instance Weighting
"... Abstract. We describe a novel framework for class noise mitigation that assigns a vector of class membership probabilities to each training instance, and uses the confidence on the current label as a weight during training. The probability vector should be calculated such that clean instances have a ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Abstract. We describe a novel framework for class noise mitigation that assigns a vector of class membership probabilities to each training instance, and uses the confidence on the current label as a weight during training. The probability vector should be calculated such that clean instances have a high confidence on its current label, while mislabeled instances have a low confidence on its current label and a high confidence on its correct label. Past research focuses on techniques that either discard or correct instances. This paper proposes that discarding and correcting are special cases of instance weighting, and thus, part of this framework. We propose a method that uses clustering to calculate a probability distribution over the class labels for each instance. We demonstrate that our method improves classifier accuracy over the original training set. We also demonstrate that instance weighting can outperform discarding. 1
Robust Support Vector Machine Training via Convex Outlier Ablation
"... One of the well known risks of large margin training methods, such as boosting and support vector machines (SVMs), is their sensitivity to outliers. These risks are normally mitigated by using a soft margin criterion, such as hinge loss, to reduce outlier sensitivity. In this paper, we present a mor ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
One of the well known risks of large margin training methods, such as boosting and support vector machines (SVMs), is their sensitivity to outliers. These risks are normally mitigated by using a soft margin criterion, such as hinge loss, to reduce outlier sensitivity. In this paper, we present a more direct approach that explicitly incorporates outlier suppression in the training process. In particular, we show how outlier detection can be encoded in the large margin training principle of support vector machines. By expressing a convex relaxation of the joint training problem as a semidefinite program, one can use this approach to robustly train a support vector machine while suppressing outliers. We demonstrate that our approach can yield superior results to the standard soft margin approach in the presence of outliers.
Learning Naive Bayes Classifier from Noisy Data
, 2003
"... Classification is one of the major tasks in knowledge discovery and data mining. Naive Bayes classifier, in spite of its simplicity, has proven surprisingly effective in many practical applications. In real datasets, noise is inevitable, because of the imprecision of measurement or privacy preservin ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Classification is one of the major tasks in knowledge discovery and data mining. Naive Bayes classifier, in spite of its simplicity, has proven surprisingly effective in many practical applications. In real datasets, noise is inevitable, because of the imprecision of measurement or privacy preserving mechanisms. In this paper, we develop a new approach, LinEar-Equation-based noise-aWare bAYes classifier (LEEWAY), for learning the underlying naive Bayes classifier from noisy observations. Using
Dealing with predictive-but-unpredictable attributes in noisy data sources
- In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 04
, 2004
"... Abstract. Attribute noise can affect classification learning. Previous work in handling attribute noise has focused on those predictable attributes that can be predicted by the class and other attributes. However, attributes can often be predictive but unpredictable. Being predictive, they are essen ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract. Attribute noise can affect classification learning. Previous work in handling attribute noise has focused on those predictable attributes that can be predicted by the class and other attributes. However, attributes can often be predictive but unpredictable. Being predictive, they are essential to classification learning and it is important to handle their noise. Being unpredictable, they require strategies different from those of predictable attributes. This paper presents a study on identifying, cleansing and measuring noise for predictive-but-unpredictable attributes. New strategies are accordingly proposed. Both theoretical analysis and empirical evidence suggest that these strategies are more effective and more efficient than previous alternatives. 1
Impact of learning set quality and size on decision tree performances
- Int. Journal of Computers, Systems and Signals
, 2000
"... Abstract. The quality of a decision tree is usually evaluated through its complexity and its generalization accuracy. Tree-simpliÞcation procedures aim at optimizing these two performance criteria. Among them, data reduction techniques differ from pruning by their simpliÞcation strategy. Actually, w ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract. The quality of a decision tree is usually evaluated through its complexity and its generalization accuracy. Tree-simpliÞcation procedures aim at optimizing these two performance criteria. Among them, data reduction techniques differ from pruning by their simpliÞcation strategy. Actually, while pruning algorithms directly control tree size to combat the overÞtting problem, data reduction techniques perform a data preprocessing prior to decision tree construction to improve the learning set quality. Recent experimental results have shown that randomly manipulating training set size has a direct impact on tree size, and therefore recommend the use of the latter simpliÞcation strategy. In this paper, we provide theoretical arguments justifying data preprocessing in favor of tree simpliÞcation. We also investigate new data reduction techniques, usually used in the Þeld of prototype selection. From experiments with 22 datasets, we show that some of them are very efficient to improve standard post-pruning performances.
Outlier management in intelligent data analysis
, 2000
"... In spite of many statistical methods for outlier detection and for robust analysis, there is little work on further analysis of outliers themselves to determine their origins. For example, there are “good ” outliers that provide useful information that can lead to the discovery of new knowledge, or ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In spite of many statistical methods for outlier detection and for robust analysis, there is little work on further analysis of outliers themselves to determine their origins. For example, there are “good ” outliers that provide useful information that can lead to the discovery of new knowledge, or “bad ” outliers that include noisy data points. Successfully distinguishing between different types of outliers is an important issue in many applications, including fraud detection, medical tests, process analysis and scientific discovery. It requires not only an understanding of the mathematical properties of data but also relevant knowledge in the domain context in which the outliers occur. This thesis presents a novel attempt in automating the use of domain knowledge in helping distinguish between different types of outliers. Two complementary knowledge-based outlier analysis strategies are proposed: one using knowledge regarding how “normal data ” should be distributed in a domain of interest in order to identify “good ” outliers, and the other using the understanding of “bad ” outliers. This kind of knowledge-based outlier analysis is a useful extension to existing work in both statistical and computing communities on outlier detection.
Improvement of the State Merging Rule on Noisy Data in Probabilistic Grammatical Inference
- 10th European Conference on Machine Learning. Number 2837 in LNAI, Springer-Verlag (2003) 169–1180
, 2003
"... In this paper we study the influence of noise in probabilistic grammatical inference. We paradoxically bring out the idea that specialized automata deal better with noisy data than more general ones. We propose then to replace the statistical test of the Alergia algorithm by a more restrictive m ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In this paper we study the influence of noise in probabilistic grammatical inference. We paradoxically bring out the idea that specialized automata deal better with noisy data than more general ones. We propose then to replace the statistical test of the Alergia algorithm by a more restrictive merging rule based on a test of proportion comparison.
Data Mining for Prediction. Financial Series Case, Doctoral Thesis, The Royal
, 2003
"... ii Hard problems force innovative approaches and attention to detail, their exploration often contributing beyond the area initially attempted. This thesis investigates the data mining process resulting in a predictor for numerical series. The series experimented with come from financial data – usua ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
ii Hard problems force innovative approaches and attention to detail, their exploration often contributing beyond the area initially attempted. This thesis investigates the data mining process resulting in a predictor for numerical series. The series experimented with come from financial data – usually hard to forecast. One approach to prediction is to spot patterns in the past, when we already know what followed them, and to test on more recent data. If a pattern is followed by the same outcome frequently enough, we can gain confidence that it is a genuine relationship. Because this approach does not assume any special knowledge or form of the regularities, the method is quite general – applicable to other time series, not just financial. However, the generality puts strong demands on the pattern detection – as to notice regularities in any of the many possible forms. The thesis ’ quest for an automated pattern-spotting involves numerous data mining and optimization techniques: neural networks, decision trees, nearest neighbors, regression, genetic algorithms and other. Comparison of their performance on a stock exchange index data is one of the contributions. As no single technique performed sufficiently well, a number of predictors have been put together, forming a voting ensemble. The vote is diversified not only by different training data – as usually done – but also by a learning method and its parameters. An approach is also proposed how to speed-up a predictor fine-tuning. The algorithm development goes still further: A prediction can only be as good as the training data, therefore the need for good data preprocessing. In particular, new multivariate discretization and attribute selection algorithms are presented. The thesis also includes overviews of prediction pitfalls and possible solutions, as well as of ensemble-building for series data with financial characteristics, such as noise and many attributes. The Ph.D. thesis consists of an extended background on financial prediction, 7 papers, and 2 appendices. iii
Ensemble of Classifiers for Noise Detection in PoS Tagged Corpora
, 2000
"... In this paper we apply the ensemble approach to the identification of incorrectly annotated items (noise) in a training set. In a controlled experiment, memory-based, decision tree-based and transformation-based classifiers are used as a filter to detect and remove noise deliberately introduced into ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper we apply the ensemble approach to the identification of incorrectly annotated items (noise) in a training set. In a controlled experiment, memory-based, decision tree-based and transformation-based classifiers are used as a filter to detect and remove noise deliberately introduced into a manually tagged corpus. The results indicate that the method can be successfully applied to automatically detect errors in a corpus.
A Comparison of Outlier Detection Algorithms for Machine Learning
, 2005
"... In this paper a comparison of outlier detection algorithms is presented, we present an overview on outlier detection methods and experimental results of six implemented methods. We applied these methods for the prediction of stellar populations parameters as well as on machine learning benchmark ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper a comparison of outlier detection algorithms is presented, we present an overview on outlier detection methods and experimental results of six implemented methods. We applied these methods for the prediction of stellar populations parameters as well as on machine learning benchmark data, inserting artificial noise and outliers. We used kernel principal component analysis in order to reduce the dimensionality of the spectral data. Experiments on noisy and noiseless data were performed.

