Results 1  10
of
98
Spam filtering using statistical data compression models
 Journal of Machine Learning Research
, 2006
"... Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task call ..."
Abstract

Cited by 72 (12 self)
 Add to MetaCart
(Show Context)
Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on characterlevel or binary sequences. By modeling messages as sequences, tokenization and other errorprone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.
Understanding Complex Network Attack Graphs through Clustered Adjacency
 Matrices”, Proceedings of the 21st Annual Computer Security Applications Conference (ACSAC
, 2005
"... We apply adjacency matrix clustering to network attack graphs for attack correlation, prediction, and hypothesizing. We selfmultiply the clustered adjacency matrices to show attacker reachability across the network for a given number of attack steps, culminating in transitive closure for attack pre ..."
Abstract

Cited by 28 (4 self)
 Add to MetaCart
(Show Context)
We apply adjacency matrix clustering to network attack graphs for attack correlation, prediction, and hypothesizing. We selfmultiply the clustered adjacency matrices to show attacker reachability across the network for a given number of attack steps, culminating in transitive closure for attack prediction over all possible number of steps. This reachability analysis provides a concise summary of the impact of network configuration changes on the attack graph. Using our framework, we also place intrusion alarms in the context of vulnerabilitybased attack graphs, so that false alarms become apparent and missed detections can be inferred. We introduce a graphical technique that shows multiplestep attacks by matching rows and columns of the clustered adjacency matrix. This allows attack impact/responses to be identified and prioritized according to the number of attack steps to victim machines, and allows attack origins to be determined. Our techniques have quadratic complexity in the size of the attack graph. 1.
A Classification for Community Discovery Methods in Complex Networks
, 2011
"... Many realworld networks are intimately organized according to a community structure. Much research effort has been devoted to develop methods and algorithms that can efficiently highlight this hidden structure of a network, yielding a vast literature on what is called today community detection. S ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
Many realworld networks are intimately organized according to a community structure. Much research effort has been devoted to develop methods and algorithms that can efficiently highlight this hidden structure of a network, yielding a vast literature on what is called today community detection. Since network representation can be very complex and can contain different variants in the traditional graph model, each algorithm in the literature focuses on some of these properties and establishes, explicitly or implicitly, its own definition of community. According to this definition, each proposed algorithm then extracts the communities, which typically reflect only part of the features of real communities. The aim of this survey is to provide a ‘user manual’ for the community discovery problem. Given a meta definition of what a community in a social network is, our aim is to organize the main categories of community discovery methods based on the definition of community they adopt. Given a desired definition of community and the features of a problem (size of network, direction of edges, multidimensionality, and so on) this review paper is designed to provide a set of approaches that researchers could focus on. The proposed classification of community discovery methods is also useful for putting into perspective the many open
Even an Ant Can Create an XSD
, 2008
"... The XML has undoubtedly become a standard for data representation and manipulation. But most of XML documents are still created without the respective description of its structure, i.e. an XML schema. Hence, in this paper we focus on the problem of automatic inferring of an XML schema for a given s ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
The XML has undoubtedly become a standard for data representation and manipulation. But most of XML documents are still created without the respective description of its structure, i.e. an XML schema. Hence, in this paper we focus on the problem of automatic inferring of an XML schema for a given sample set of XML documents. In particular, we focus on new features of XML Schema language and we propose an algorithm which is an improvement of a combination of verified approaches that is, at the same time, enough general and can be further enhanced. Using a set of experiments we illustrate the behavior of the algorithm on both realworld and artificial XML data.
Robust informationtheoretic clustering
 IN: KDD
, 2006
"... How do we find a natural clustering of a real world point set, which contains an unknown number of clusters with different shapes, and which may be contaminated by noise? Most clustering algorithms were designed with certain assumptions (Gaussianity), they often require the user to give input parame ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
How do we find a natural clustering of a real world point set, which contains an unknown number of clusters with different shapes, and which may be contaminated by noise? Most clustering algorithms were designed with certain assumptions (Gaussianity), they often require the user to give input parameters, and they are sensitive to noise. In this paper, we propose a robust framework for determining a natural clustering of a given data set, based on the minimum description length (MDL) principle. The proposed framework, Robust Informationtheoretic Clustering (RIC), is orthogonal to any known clustering algorithm: given a preliminary clustering, RIC purifies these clusters from noise, and adjusts the clusterings such that it simultaneously determines the most natural amount and shape (subspace) of the clusters. Our RIC method can be combined with any clustering technique ranging from Kmeans and Kmedoids to advanced methods such as spectral clustering. In fact, RIC is even able to purify and improve an initial coarse clustering, even if we start with very simple methods such as gridbased space partitioning. Moreover, RIC scales well with the data set size. Extensive experiments on synthetic and real world data sets validate the proposed RIC framework.
MDL denoising revisited
 IEEE Transactions on Signal Processing, 57(9):3347 – 3360
, 2009
"... Abstract — We refine and extend an earlier MDL denoising criterion for waveletbased denoising. We start by showing that the denoising problem can be reformulated as a clustering problem, where the goal is to obtain separate clusters for informative and noninformative wavelet coefficients, respecti ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
Abstract — We refine and extend an earlier MDL denoising criterion for waveletbased denoising. We start by showing that the denoising problem can be reformulated as a clustering problem, where the goal is to obtain separate clusters for informative and noninformative wavelet coefficients, respectively. This suggests two refinements, adding a codelength for the model index, and extending the model in order to account for subbanddependent coefficient distributions. A third refinement is derivation of soft thresholding inspired by predictive universal coding with weighted mixtures. We propose a practical method incorporating all three refinements, which is shown to achieve good performance and robustness in denoising both artificial and natural signals. Index Terms — Minimum description length (MDL) principle, wavelets, denoising. I.
Adaptive design optimization: A mutual information based approach to model discrimination in cognitive science
 Neural Computation
, 2010
"... Discriminating among competing statistical models is a pressing issue for many experimentalists in the field of cognitive science. Resolving this issue begins with designing maximally informative experiments. To this end, the problem to be solved in adaptive design optimization is identifying experi ..."
Abstract

Cited by 9 (4 self)
 Add to MetaCart
Discriminating among competing statistical models is a pressing issue for many experimentalists in the field of cognitive science. Resolving this issue begins with designing maximally informative experiments. To this end, the problem to be solved in adaptive design optimization is identifying experimental designs under which one can infer the underlying model in the fewest possible steps. When the models under consideration are nonlinear, as is often the case in cognitive science, this problem can be impossible to solve analytically without simplifying assumptions. However, as we show in this paper, a full solution can be found numerically with the help of a Bayesian computational trick derived from the statistics literature, which recasts the problem as a probability density simulation in which the optimal design is the mode of the density. We use a utility function based on mutual information, and give three intuitive interpretations of the utility function in terms of Bayesian posterior estimates. As a proof of concept, we offer a simple example application to an experiment on memory retention. 1
Internet Traffic Classification Demystified: On the Sources of the Discriminative Power
"... Recent research on Internet traffic classification has yield a number of data mining techniques for distinguishing types of traffic, but no systematic analysis on “Why " some algorithms achieve high accuracies. In pursuit of empirically grounded answers to the “Why " question, which is cri ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
Recent research on Internet traffic classification has yield a number of data mining techniques for distinguishing types of traffic, but no systematic analysis on “Why " some algorithms achieve high accuracies. In pursuit of empirically grounded answers to the “Why " question, which is critical in understanding and establishing a scientific ground for traffic classification research, this paper reveals the three sources of the discriminative power in classifying the Internet application traffic: (i) ports, (ii) the sizes of the first onetwo (for UDP flows) or fourfive (for TCP flows) packets, and (iii) discretization of those features. We find that C4.5 performs the best under any circumstances, as well as the reason why; because the algorithm discretizes input features during classification operations. We also find that the entropybased Minimum Description Length discretization on ports and packet size features substantially improve the classification accuracy of every machine learning algorithm tested (by as much as 59.8%!) and make all of them achieve>93 % accuracy on average without any algorithmspecific tuning processes. Our results indicate that dealing with the ports and packet size features as discrete nominal intervals, not as continuous numbers, is the essential basis for accurate traffic classification (i.e., the features should be discretized first), regardless of classification algorithms to use.
Hierarchical, ParameterFree Community Discovery
"... Abstract. Given a large bipartite graph (like documentterm, or userproduct graph), how can we find meaningful communities, quickly, and automatically? We propose to look for community hierarchies, with communitieswithincommunities. Our proposed method, the Contextspecific Cluster Tree (CCT) find ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Given a large bipartite graph (like documentterm, or userproduct graph), how can we find meaningful communities, quickly, and automatically? We propose to look for community hierarchies, with communitieswithincommunities. Our proposed method, the Contextspecific Cluster Tree (CCT) finds such communities at multiple levels, with no user intervention, based on information theoretic principles (MDL). More specifically, it partitions the graph into progressively more refined subgraphs, allowing users to quickly navigate from the global, coarse structure of a graph to more focused and local patterns. As a fringe benefit, and also as an additional indication of its quality, it also achieves better compression than typical, nonhierarchical methods. We demonstrate its scalability and effectiveness on real, large graphs. 1