Results 1 - 10
of
26
Toward integrating feature selection algorithms for classification and clustering
- IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2005
"... This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals ..."
Abstract
-
Cited by 267 (21 self)
- Add to MetaCart
(Show Context)
This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals unattempted combinations, and provides guidelines in selecting feature selection algorithms. With the categorizing framework, we continue our efforts toward building an integrated system for intelligent feature selection. A unifying platform is proposed as an intermediate step. An illustrative example is presented to show how existing feature selection algorithms can be integrated into a meta algorithm that can take advantage of individual algorithms. An added advantage of doing so is to help a user employ a suitable algorithm without knowing details of each algorithm. Some real-world applications are included to demonstrate the use of feature selection in data mining. We conclude this work by identifying trends and challenges of feature selection research and development.
A selective sampling approach to active feature selection
- Artificial Intelligence 159(1-2
, 2004
"... Feature selection, as a preprocessing step to machine learning, has been very effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. Traditional feature selection methods resort to random sampling in dealing with data set ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
(Show Context)
Feature selection, as a preprocessing step to machine learning, has been very effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. Traditional feature selection methods resort to random sampling in dealing with data sets with a huge number of instances. In this paper, we introduce the concept of active feature selection, and investigate a selective sampling approach to active feature selection in a filter model setting. We present a formalism of selective sampling based on data variance, and apply it to a widely used feature selection algorithm Relief. Further, we show how it realizes active feature selection and reduces the required number of training instances to achieve time savings without performance deterioration. We design objective evaluation measures of performance, conduct extensive experiments using both synthetic and benchmark data sets, and observe consistent and significant improvement. We suggest some further work based on our study and experiments.
Feature Selection with Selective Sampling
- In Proceedings of the Nineteenth International Conference on Machine Learning
, 2002
"... Feature selection, as a preprocessing step to machine learning, has been shown very effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving comprehensibility. In this paper, we consider the problem of active feature selection in a lter model ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
Feature selection, as a preprocessing step to machine learning, has been shown very effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving comprehensibility. In this paper, we consider the problem of active feature selection in a lter model setting. We describe a formalism of active feature selection called selective sampling, demonstrate it by applying it to a widely used feature selection algorithm Relief, and show how it realizes active feature selection and reduces the required number of training data for Relief to achieve time savings without performance deterioration. We design objective evaluation measures, conduct extensive experiments using bench-mark data sets, and observe consistent and signi cant improvement.
Opcode sequences as representation of executables for data-mining-based unknown malware detection
- INFORMATION SCIENCES 227
, 2013
"... Malware can be defined as any type of malicious code that has the potential to harm a computer or network. The volume of malware is growing faster every year and poses a serious global security threat. Consequently, malware detection has become a critical topic in computer security. Currently, signa ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
(Show Context)
Malware can be defined as any type of malicious code that has the potential to harm a computer or network. The volume of malware is growing faster every year and poses a serious global security threat. Consequently, malware detection has become a critical topic in computer security. Currently, signature-based detection is the most widespread method used in commercial antivirus. In spite of the broad use of this method, it can detect malware only after the malicious executable has already caused damage and provided the malware is adequately documented. Therefore, the signature-based method consistently fails to detect new malware. In this paper, we propose a new method to detect unknown malware families. This model is based on the frequency of the appearance of opcode sequences. Furthermore, we describe a technique to mine the relevance of each opcode and assess the frequency of each opcode sequence. In addition, we provide empirical validation that this new method is capable of detecting unknown malware.
Instances Selection Using Advance Data Mining Techniques”,
- 53, ISSN Print: 0976 – 6367, ISSN Online: 0976 –
, 2012
"... ABSTRACT Genetic algorithms (GA) are optimization techniques inspired from natural evolution processes. They handle a population of individuals that evolve with the help of information exchange procedures. In this paper we proposed genetic algorithms (GA) approach to optimize of connection weights ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
ABSTRACT Genetic algorithms (GA) are optimization techniques inspired from natural evolution processes. They handle a population of individuals that evolve with the help of information exchange procedures. In this paper we proposed genetic algorithms (GA) approach to optimize of connection weights and instance selection for artificial neural networks (ANNs) to predict the stock price index. ANN has preeminent learning ability, but often exhibit inconsistent and unpredictable performance for noisy data. In this paper GA is employed not only to improve the learning algorithm, but also to reduce the complexity in feature space. GA optimizes simultaneously the connection weights between layers and a selection of relevant instances. This study applies the proposed model to India Cements Stock Price Index (ICSPI) analysis. Experimental results show that the GA approach is a promising method for instance selection and optimize the connection weight between layers.
Parallel Distributed Genetic Fuzzy Rule Selection
- SOFT COMPUTING (SPECIAL ISSUE ON GENETIC FUZZY SYSTEMS)
"... Genetic fuzzy rule selection has been successfully used to design accurate and compact fuzzy rulebased classifiers. It is, however, very difficult to handle large data sets due to the increase in computational costs. This paper proposes a simple but effective idea to improve the scalability of genet ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Genetic fuzzy rule selection has been successfully used to design accurate and compact fuzzy rulebased classifiers. It is, however, very difficult to handle large data sets due to the increase in computational costs. This paper proposes a simple but effective idea to improve the scalability of genetic fuzzy rule selection to large data sets. Our idea is based on its parallel distributed implementation. Both a training data set and a population are divided into subgroups (i.e., into training data subsets and sub-populations, respectively) for the use of multiple processors. We compare seven variants of the parallel distributed implementation with the original non-parallel algorithm through computational experiments on some benchmark data sets.
A first study on the use of coevolutionary algorithms for instance and feature selection
"... Abstract. Cooperative Coevolution is a technique in the area of Evolutionary Computation. It has been applied to many combinatorial problems with great success. This contribution proposes a Cooperative Coevolution model for simultaneous performing some data reduction processes in classification with ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Cooperative Coevolution is a technique in the area of Evolutionary Computation. It has been applied to many combinatorial problems with great success. This contribution proposes a Cooperative Coevolution model for simultaneous performing some data reduction processes in classification with nearest neighbours methods through feature and instance selection. In order to check its performance, we have compared the proposal with other evolutionary approaches for performing data reduction. Results have been analyzed and contrasted by using non-parametric statistical tests, finally showing that the proposed model outperforms the noncooperative evolutionary techniques. 1
Integrating Instance Selection, Instance Weighting, and Feature Weighting for Nearest Neighbor Classifiers by Coevolutionary Algorithms
"... Abstract—Cooperative coevolution is a successful trend of evo-lutionary computation which allows us to define partitions of the domain of a given problem, or to integrate several related techniques into one, by the use of evolutionary algorithms. It is possible to apply it to the development of adva ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Cooperative coevolution is a successful trend of evo-lutionary computation which allows us to define partitions of the domain of a given problem, or to integrate several related techniques into one, by the use of evolutionary algorithms. It is possible to apply it to the development of advanced classification methods, which integrate several machine learning techniques into a single proposal. A novel approach integrating instance selection, instance weighting, and feature weighting into the framework of a coevolutionary model is presented in this paper. We compare it with a wide range of evolutionary and nonevolutionary related methods, in order to show the benefits of the employment of coevolution to apply the techniques considered simultaneously. The results obtained, contrasted through nonparametric statistical tests, show that our proposal outperforms other methods in the comparison, thus becoming a suitable tool in the task of enhancing the nearest neighbor classifier. Index Terms—Cooperative coevolution, feature weighting (FW), instance selection (IS), instance weighting (IW), nearest neighbor rule. I.
Active learning for classifying a spectrally variable subject
- In 2nd International Workshop on Pattern Recognition for Remote Sensing (PRRS 2002), Niagara Falls
, 2002
"... infrared (CIR) airphotos by automated methods presents a challenging problem due to a number of variable and unfavorable conditions: changes in imaging conditions, problems associated with water-related subjects, and other environmental changes as well as expected lack of spectral separation between ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
infrared (CIR) airphotos by automated methods presents a challenging problem due to a number of variable and unfavorable conditions: changes in imaging conditions, problems associated with water-related subjects, and other environmental changes as well as expected lack of spectral separation between Egeria and other land cover classes in CIR imagery. To address these challenges, we are developing an interactive computer system based on data mining techniques with Active Learning capabilities. The key components of this system are: feature extraction, automatic classification, active learning, and experimental evaluation. We anticipate creating an interactive learning system that can learn from human analysts by relating results to extracted objects and that can learn analytic rules for classification. In this paper, we report the concept of the system, preliminary experimental results, and anticipated future work. I. BACKGROUND AND PROBLEM Rapid advances in remote sensing and in data storage have made huge quantities of image data available for analysis. The spatial, spectral, and radiometric resolutions of digital remotely-sensed imagery have greatly improved in recent years. However, automated
COLLECTIVE CLASSIFICATION FOR UNKNOWNMALWARE DETECTION
"... Abstract: Malware is any type of computer software harmful to computers and networks. The amount of malware is increasing every year and poses as a serious global security threat. Signature-based detection is the most broadly used commercial antivirus method, however, it fails to detect new and prev ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract: Malware is any type of computer software harmful to computers and networks. The amount of malware is increasing every year and poses as a serious global security threat. Signature-based detection is the most broadly used commercial antivirus method, however, it fails to detect new and previously unseen malware. Supervised machine-learning models have been proposed in order to solve this issue, but the usefulness of supervised learning is far to be perfect because it requires a significant amount of malicious code and benign software to be identified and labelled in beforehand. In this paper, we propose a new method that adopts a collective learning approach to detect unknown malware. Collective classification is a type of semi-supervised learning that presents an interesting method for optimising the classification of partially-labelled data. In this way, we propose here, for the first time, collective classification algorithms to build different machine-learning classifiers using a set of labelled (as malware and legitimate software) and unlabelled instances. We perform an empirical validation demonstrating that the labelling efforts are lower than when supervised learning is used, while maintaining high accuracy rates. 1