Results 1 - 10
of
1,065
Learning from imbalanced data
- IEEE Trans. on Knowledge and Data Engineering
, 2009
"... Abstract—With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-m ..."
Abstract
-
Cited by 260 (6 self)
- Add to MetaCart
Abstract—With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data. Index Terms—Imbalanced learning, classification, sampling methods, cost-sensitive learning, kernel-based learning, active learning, assessment metrics. Ç
Nonparametric Latent Feature Models for Link Prediction
"... As the availability and importance of relational data—such as the friendships summarized on a social networking website—increases, it becomes increasingly important to have good models for such data. The kinds of latent structure that have been considered for use in predicting links in such networks ..."
Abstract
-
Cited by 106 (1 self)
- Add to MetaCart
As the availability and importance of relational data—such as the friendships summarized on a social networking website—increases, it becomes increasingly important to have good models for such data. The kinds of latent structure that have been considered for use in predicting links in such networks have been relatively limited. In particular, the machine learning community has focused on latent class models, adapting Bayesian nonparametric methods to jointly infer how many latent classes there are while learning which entities belong to each class. We pursue a similar approach with a richer kind of latent variable—latent features—using a Bayesian nonparametric approach to simultaneously infer the number of features at the same time we learn which entities have each feature. Our model combines these inferred features with known covariates in order to perform link prediction. We demonstrate that the greater expressiveness of this approach allows us to improve performance on three datasets. 1
Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings
- IEEE Transactions On Software Engineering
"... Abstract—Software defect prediction strives to improve software quality and testing efficiency by constructing predictive classification models from code attributes to enable a timely identification of fault-prone modules. Several classification models have been evaluated for this task. However, due ..."
Abstract
-
Cited by 86 (1 self)
- Add to MetaCart
Abstract—Software defect prediction strives to improve software quality and testing efficiency by constructing predictive classification models from code attributes to enable a timely identification of fault-prone modules. Several classification models have been evaluated for this task. However, due to inconsistent findings regarding the superiority of one classifier over another and the usefulness of metric-based classification in general, more research is needed to improve convergence across studies and further advance confidence in experimental results. We consider three potential sources for bias: comparing classifiers over one or a small number of proprietary data sets, relying on accuracy indicators that are conceptually inappropriate for software defect prediction and cross-study comparisons, and, finally, limited use of statistical testing procedures to secure empirical findings. To remedy these problems, a framework for comparative software defect prediction experiments is proposed and applied in a large-scale empirical comparison of 22 classifiers over 10 public domain data sets from the NASA Metrics Data repository. Overall, an appealing degree of predictive accuracy is observed, which supports the view that metric-based classification is useful. However, our results indicate that the importance of the particular classification algorithm may be less than previously assumed since no significant performance differences could be detected among the top 17 classifiers. Index Terms—Complexity measures, data mining, formal methods, statistical methods, software defect prediction. Ç 1
How much anonymity does network latency leak
- In CCS ’07: Proceedings of the 14th ACM conference on Computer and communications security. ACM
, 2007
"... Low-latency anonymity systems such as Tor, AN.ON, Crowds, and Anonymizer.com aim to provide anonymous connections that are both untraceable by “local ” adversaries who control only a few machines, and have low enough delay to support anonymous use of network services like web browsing and remote log ..."
Abstract
-
Cited by 76 (1 self)
- Add to MetaCart
Low-latency anonymity systems such as Tor, AN.ON, Crowds, and Anonymizer.com aim to provide anonymous connections that are both untraceable by “local ” adversaries who control only a few machines, and have low enough delay to support anonymous use of network services like web browsing and remote login. One consequence of these goals is that these services leak some information about the network latency between the sender and one or more nodes in the system. We present two attacks on low-latency anonymity schemes using this information. The first attack allows a pair of colluding web sites to predict, based on local timing information and with no additional resources, whether two connections from the same Tor exit node are using the same circuit with high confidence. The second attack requires more resources but allows a malicious website to gain several bits of information about a client each time he visits the site. We evaluate both attacks against two low-latency anonymity protocols – the Tor network and the MultiProxy proxy aggregator service – and conclude that both are highly vulnerable to these attacks. Categories and Subject Descriptors: C.2.0 [Computer Networks]: General—Security and protection;
Introduction and summary
- Social Security Programs and Retirement around the World: Micro-Estimation. National Bureau of Economic research
, 2004
"... respiratory syncytial (RS) virus. The status of GP1 has been uncertain, because a cellular glycoprotein migrates at the same position when Laemmli's discontinuous buffer system is used for PAGE, and because BSC-1 cells infected with the RSN-2 strain of RS virus appear not to contain GP1. Howeve ..."
Abstract
-
Cited by 45 (4 self)
- Add to MetaCart
(Show Context)
respiratory syncytial (RS) virus. The status of GP1 has been uncertain, because a cellular glycoprotein migrates at the same position when Laemmli's discontinuous buffer system is used for PAGE, and because BSC-1 cells infected with the RSN-2 strain of RS virus appear not to contain GP1. However, additional evidence suggests that GP1 is a viral structural protein. (i) It is removed from cells by trypsin, while the cellular glycoprotein is not; (ii) it is separated from the cellular glycoprotein when the infected cells are analysed by neutral SDS-PAGE; (iii) it is present in the purified RSN-2 strain of RS virus produced by BSC-1 cells; (iv) it is also present in the purified Long strain of RS virus produced by either human or monkey cells. When purified Long strain virus is analysed by PAGE under non-reducing conditions, the glycoproteins VGP48 and GP26 migrate together, and VPM27 separates into two proteins, which one-dimensional peptide mapping suggests are not different proteins. These observations suggest that VGP48 and GP26 exist in the virion as a single molecule joined by disulphide bonds, and so resemble a paramyxovirus fusion protein, and that probably there are two forms of VPM27 which differ in either position or number of disulphide bonds.
The Dendritic Cell Algorithm
, 2007
"... Abstract. The Dendritic Cell Algorithm is an immune-inspired algorithm originally based on the function of natural dendritic cells. The original instantiation of the algorithm is a highly stochastic algorithm. While the performance of the algorithm is good when applied to large real-time datasets, i ..."
Abstract
-
Cited by 40 (14 self)
- Add to MetaCart
Abstract. The Dendritic Cell Algorithm is an immune-inspired algorithm originally based on the function of natural dendritic cells. The original instantiation of the algorithm is a highly stochastic algorithm. While the performance of the algorithm is good when applied to large real-time datasets, it is difficult to analyse due to the number of random-based elements. In this paper a deterministic version of the algorithm is proposed, implemented and tested using a port scan dataset to provide a controllable system. This version consists of a controllable amount of parameters, which are experimented with in this paper. In addition the effects are examined of the use of time windows and variation on the number of cells, both which are shown to influence the algorithm. Finally a novel metric for the assessment of the algorithms output is introduced and proves to be a more sensitive metric than the metric used with the original Dendritic Cell Algorithm. 1
M Tangermann, A New Auditory Multi-class BrainComputer Interface Paradigm: Spatial Hearing as an Informative Cue. PLoS ONE. 5(4):e9813
, 2010
"... Most P300-based brain-computer interface (BCI) approaches use the visual modality for stimulation. For use with patients suffering from amyotrophic lateral sclerosis (ALS) this might not be the preferable choice because of sight deterioration. Moreover, using a modality different from the visual one ..."
Abstract
-
Cited by 40 (12 self)
- Add to MetaCart
(Show Context)
Most P300-based brain-computer interface (BCI) approaches use the visual modality for stimulation. For use with patients suffering from amyotrophic lateral sclerosis (ALS) this might not be the preferable choice because of sight deterioration. Moreover, using a modality different from the visual one minimizes interference with possible visual feedback. Therefore, a multi-class BCI paradigm is proposed that uses spatially distributed, auditory cues. Ten healthy subjects participated in an offline oddball task with the spatial location of the stimuli being a discriminating cue. Experiments were done in free field, with an individual speaker for each location. Different inter-stimulus intervals of 1000 ms, 300 ms and 175 ms were tested. With averaging over multiple repetitions, selection scores went over 90 % for most conditions, i.e., in over 90 % of the trials the correct location was selected. One subject reached a 100 % correct score. Corresponding information transfer rates were high, up to an average score of 17.39 bits/minute for the 175 ms condition (best subject 25.20 bits/minute). When presenting the stimuli through a single speaker, thus effectively canceling the spatial properties of the cue, selection scores went down below 70 % for most subjects. We conclude that the proposed spatial auditory paradigm is successful for healthy
Real-time classification of complex patterns using spike-based learning in neuromorphic VLSI
- IEEE Transactions on Biomedical Circuits and Systems
, 2009
"... Abstract—Real-time classification of patterns of spike trains is a difficult computational problem that both natural and artificial networks of spiking neurons are confronted with. The solution to this problem not only could contribute to understanding the fundamental mechanisms of computation used ..."
Abstract
-
Cited by 36 (12 self)
- Add to MetaCart
(Show Context)
Abstract—Real-time classification of patterns of spike trains is a difficult computational problem that both natural and artificial networks of spiking neurons are confronted with. The solution to this problem not only could contribute to understanding the fundamental mechanisms of computation used in the biological brain, but could also lead to efficient hardware implementations of a wide range of applications ranging from autonomous sensory-motor systems to brain-machine interfaces. Here we demonstrate real-time classification of complex patterns of mean firing rates, using a VLSI network of spiking neurons and dynamic synapses which implement a robust spike-driven plasticity mechanism. The learning rule implemented is a supervised one: a teacher signal provides the output neuron with an extra input spike-train during training, in parallel to the spike-trains that represent the input pattern. The teacher signal simply indicates if the neuron should respond to the input pattern with a high rate or with a low one. The learning mechanism modifies the synaptic weights only as long as the current generated by all the stimulated plastic synapses does not match the output desired by the teacher, as in the perceptron learning rule. We describe the implementation of this learning mechanism and present experimental data that demonstrate how the VLSI neural network can learn to classify patterns of neural activities, also in the case in which they are highly correlated. Index Terms—Classification, learning, neuromorphic VLSI, silicon neuron, silicon synapse, spike-based plasticity, synaptic dynamics.
Improving automatic music tag annotation using stacked generalization of probabilistic svm outputs
- in Proc. of the 17th ACM Int. Conf. on Multimedia (MM -09
, 2009
"... Music listeners frequently use words to describe music. Personalized music recommendation systems such as Last.fm and Pandora rely on manual annotations (tags) as a mechanism for querying and navigating large music collections. A well-known issue in such recommendation systems is known as the cold-s ..."
Abstract
-
Cited by 33 (3 self)
- Add to MetaCart
(Show Context)
Music listeners frequently use words to describe music. Personalized music recommendation systems such as Last.fm and Pandora rely on manual annotations (tags) as a mechanism for querying and navigating large music collections. A well-known issue in such recommendation systems is known as the cold-start problem: it is not possible to recommend new songs/tracks until those songs/tracks have been manually annotated. Automatic tag annotation based on content analysis is a potential solution to this problem and has recently been gaining attention. We describe how stacked generalization can be used to improve the performance of a state-of-the-art automatic tag annotation system for music based on audio content analysis and report results on two publicly available datasets.