Results 11  20
of
41
Mining TopK Patterns from Binary Datasets in presence of Noise
"... The discovery of patterns in binary dataset has many applications, e.g. in electronic commerce, TCP/IP networking, Web usage logging, etc. Still, this is a very challenging task in many respects: overlapping vs. non overlapping patterns, presence of noise, extraction of the most important patterns o ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
The discovery of patterns in binary dataset has many applications, e.g. in electronic commerce, TCP/IP networking, Web usage logging, etc. Still, this is a very challenging task in many respects: overlapping vs. non overlapping patterns, presence of noise, extraction of the most important patterns only. In this paper we formalize the problem of discovering the TopK patterns from binary datasets in presence of noise, as the minimization of a novel cost function. According to the Minimum Description Length principle, the proposed cost function favors succinct pattern sets that may approximately describe the input data. We propose a greedy algorithm for the discovery of Patterns in Noisy Datasets, named PaNDa, and show that it outperforms related techniques on both synthetic and realworld data.
V.: Distributed Algorithm for Computing Formal Concepts Using MapReduce Framework, IDA ’09
, 2009
"... Abstract. Searching for interesting patterns in binary matrices plays an important role in data mining and, in particular, in formal concept analysis and related disciplines. Several algorithms for computing particular patterns represented by maximal rectangles in binary matrices were proposed but ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Searching for interesting patterns in binary matrices plays an important role in data mining and, in particular, in formal concept analysis and related disciplines. Several algorithms for computing particular patterns represented by maximal rectangles in binary matrices were proposed but their major drawback is their computational complexity limiting their application on relatively small datasets. In this paper we introduce a scalable distributed algorithm for computing maximal rectangles that uses the mapreduce approach to data processing. 1
Mining compression sequential patterns
"... Compression based pattern mining has been successfully applied to many data mining tasks. We propose an approach based on the minimum description length principle to extract sequential patterns that compress a database of sequences well. We show that mining compressing patterns is NPHard and belong ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Compression based pattern mining has been successfully applied to many data mining tasks. We propose an approach based on the minimum description length principle to extract sequential patterns that compress a database of sequences well. We show that mining compressing patterns is NPHard and belongs to the class of inapproximable problems. We propose two heuristic algorithms to mining compressing patterns. The first uses a twophase approach similar to Krimp for itemset data. To overcome performance with the required candidate generation we propose GoKrimp, an effective greedy algorithm that directly mines compressing patterns. We conduct an empirical study on six reallife datasets to compare the proposed algorithms by run time, compressibility, and classification accuracy using the patterns found as features for SVM classifiers.
Vychodil: Factor Analysis of incidence data via Novel Decomposition of Matrices
 In: S. Ferré and S. Rudolph (Eds.): ICFCA 2009, LNAI 5548
, 2009
"... Abstract. Matrix decomposition methods provide representations of an objectvariable data matrix by a product of two different matrices, one describing relationship between objects and hidden variables or factors, and the other describing relationship between the factors and the original variables. ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Matrix decomposition methods provide representations of an objectvariable data matrix by a product of two different matrices, one describing relationship between objects and hidden variables or factors, and the other describing relationship between the factors and the original variables. We present a novel approach to decomposition and factor analysis of matrices with incidence data. The matrix entries are grades to which objects represented by rows satisfy attributes represented by columns, e.g. grades to which an image is red or a person performs well in a test. We assume that the grades belong to a scale bounded by 0 and 1 which is equipped with certain aggregation operators and forms a complete residuated lattice. We present an approximation algorithm for the problem of decomposition of such matrices with grades into products of two matrices with grades with the number of factors as small as possible. Decomposition of binary matrices into Boolean products of binary matrices is a special case of this problem in which 0 and 1 are the only grades. Our algorithm is based on a geometric insight provided by a theorem identifying particular rectangularshaped submatrices as optimal factors for the decompositions. These factors correspond to formal concepts of the input data and allow for an easy interpretation of the decomposition. We present the problem formulation, basic geometric insight, algorithm, illustrative example, experimental evaluation.
On Finding Joint Subspace Boolean Matrix Factorizations
"... Finding latent factors of the data using matrix factorizations is a triedandtested approach in data mining. But finding shared factors over multiple matrices is more novel problem. Specifically, given two matrices, we want to find a set of factors shared by these two matrices and sets of factors s ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Finding latent factors of the data using matrix factorizations is a triedandtested approach in data mining. But finding shared factors over multiple matrices is more novel problem. Specifically, given two matrices, we want to find a set of factors shared by these two matrices and sets of factors specific for the matrices. Not only does such decomposition reveal what is common between the two matrices, it also eliminates the need of explaining that common part twice, thus concentrating the nonshared factors to uniquely specific parts of the data. This paper studies a problem called Joint Subspace Boolean Matrix Factorization asking exactly that: a set of shared factors and sets of specific factors. Furthermore, the matrix factorization is based on the Boolean arithmetic. This restricts the presented approach suitable to only binary matrices. The benefits, however, include much sparser factor matrices and greater interpretability of the results. The paper presents three algorithms for finding the Joint Subspace Boolean Matrix Factorization, an MDLbased method for selecting the subspaces ’ dimensionality, and throughout experimental evaluation of the proposed algorithms. 1
Fast and Reliable Anomaly Detection in Categorical Data
"... Spotting anomalies in large multidimensional databases is a crucial task with many applications in finance, health care, security, etc. We introduce COMPREX, a new approach for identifying anomalies using patternbased compression. Informally, our method finds a collection of dictionaries that desc ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Spotting anomalies in large multidimensional databases is a crucial task with many applications in finance, health care, security, etc. We introduce COMPREX, a new approach for identifying anomalies using patternbased compression. Informally, our method finds a collection of dictionaries that describe the norm of a database succinctly, and subsequently flags those points dissimilar to the norm—with high compression cost—as anomalies. Our approach exhibits four key features: 1) it is parameterfree; it builds dictionaries directly from data, and requires no userspecified parameters such as distance functions or density and similarity thresholds, 2) it is general; we show it works for a broad range of complex databases, including graph, image and relational databases that may contain both categorical and numerical features, 3) it is scalable; its running time grows linearly with respect to both database size as well as number of dimensions, and 4) it is effective; experiments on a broad range of datasets show large improvements in both compression, as well as precision in anomaly detection, outperforming its stateoftheart competitors.
The blind men and the elephant: On meeting the problem of multiple truths in data from clustering and pattern mining perspectives
, 2013
"... ..."
Trnecka M.: FromBelow Approximations in Boolean Matrix Factorization: Geometry and New Algorithm. http://arxiv.org/abs/1306.4905
, 2013
"... We present new results on Boolean matrix factorization and a new algorithm based on these results. The results emphasize the significance of factorizations that provide frombelow approximations of the input matrix. While the previously proposed algorithms do not consider the possibly different sig ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
We present new results on Boolean matrix factorization and a new algorithm based on these results. The results emphasize the significance of factorizations that provide frombelow approximations of the input matrix. While the previously proposed algorithms do not consider the possibly different significance of different matrix entries, our results help measure such significance and suggest where to focus when computing factors. An experimental evaluation of the new algorithm on both synthetic and real data demonstrates its good performance in terms of good coverage by the first k factors as well as a small number of factors needed for exact decomposition and indicates that the algorithm outperforms the available ones in these terms. We also propose future research topics.
Dynamic Boolean Matrix Factorizations
"... Abstract—Boolean matrix factorization is a method to decompose a binary matrix into two binary factor matrices. Akin to other matrix factorizations, the factor matrices can be used for various data analysis tasks. Many (if not most) realworld data sets are dynamic, though, meaning that new informat ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Boolean matrix factorization is a method to decompose a binary matrix into two binary factor matrices. Akin to other matrix factorizations, the factor matrices can be used for various data analysis tasks. Many (if not most) realworld data sets are dynamic, though, meaning that new information is recorded over time. Incorporating this new information into the factorization can require a recomputation of the factorization – something we cannot do if we want to keep our factorization uptodate after each update. This paper proposes a method to dynamically update the Boolean matrix factorization when new data is added to the data base. This method is extended with a mechanism to improve the factorization with a tradeoff in speed of computation. The method is tested with a number of realworld and synthetic data sets including studying its efficiency against offline methods. The results show that with good initialization the proposed online and dynamic methods can beat the stateoftheart offline Boolean matrix factorization algorithms. KeywordsBoolean matrix factorization; Online algorithms; Dynamic algorithms I.
Fully Dynamic QuasiBiclique Edge Covers via Boolean Matrix Factorizations
"... An important way of summarizing a bipartite graph is to give a set of (quasi) bicliques that contain (almost) all of its edges. These quasibicliques are somewhat similar to clustering of the nodes, giving sets of similar nodes. Unlike clustering, however, the quasibicliques are not required to pa ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
An important way of summarizing a bipartite graph is to give a set of (quasi) bicliques that contain (almost) all of its edges. These quasibicliques are somewhat similar to clustering of the nodes, giving sets of similar nodes. Unlike clustering, however, the quasibicliques are not required to partition the nodes, allowing greater flexibility when creating them. When we identify the bipartite graph with its biadjacency matrix, the problem of finding these quasibicliques turns into the problem of finding the Boolean matrix factorization of the biadjacency matrix – a problem that has received increasing research interest in data mining in recent years. But many realworld graphs are dynamic and evolve over time. How can we update our bicliques without having to recompute them from the scratch? An algorithm was recently proposed for this task (Miettinen, ICMD 2012). The algorithm, however, is only able to handle the case where the new 1s are added to the matrix – it cannot handle the removal of existing 1s. Furthermore, the algorithm cannot adjust the rank of the factorization. This paper extends said algorithm with the capability of working in fully dynamic setting (with both additions and deletions) and with capability of adjusting its rank dynamically, as well. The behaviour and performance of the algorithm is studied in experiments conducted with both realworld and synthetic data.