Results 1  10
of
11
Model order selection for Boolean matrix factorization
 In KDD
, 2011
"... Matrix factorizations—where a given data matrix is approximated by a product of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the ‘model order selection prob ..."
Abstract

Cited by 13 (9 self)
 Add to MetaCart
(Show Context)
Matrix factorizations—where a given data matrix is approximated by a product of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the ‘model order selection problem ’ of determining where finegrained structure stops, and noise starts, i.e., what is the proper size of the factor matrices. Boolean matrix factorization (BMF)—where data, factors, and matrix product are Boolean—has received increased attention from the data mining community in recent years. The technique has desirable properties, such as high interpretability and natural sparsity. But so far no method for selecting the correct model order for BMF has been available. In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this wellfounded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate. We formulate the description length function for BMF in general— making it applicable for any BMF algorithm. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior.
On Local Intrinsic Dimension Estimation and Its Applications
"... Abstract—In this paper, we present multiple novel applications for local intrinsic dimension estimation. There has been much work done on estimating the global dimension of a data set, typically for the purposes of dimensionality reduction. We show that by estimating dimension locally, we are able t ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Abstract—In this paper, we present multiple novel applications for local intrinsic dimension estimation. There has been much work done on estimating the global dimension of a data set, typically for the purposes of dimensionality reduction. We show that by estimating dimension locally, we are able to extend the uses of dimension estimation to many applications, which are not possible with global dimension estimation. Additionally, we show that local dimension estimation can be used to obtain a better global dimension estimate, alleviating the negative bias that is common to all known dimension estimation algorithms. We illustrate local dimension estimation’s uses towards additional applications, such as learning on statistical manifolds, network anomaly detection, clustering, and image segmentation. Index Terms—Geodesics, image segmentation, intrinsic dimension, manifold learning, nearest neighbor graph. I.
Vychodil: Factor Analysis of incidence data via Novel Decomposition of Matrices
 In: S. Ferré and S. Rudolph (Eds.): ICFCA 2009, LNAI 5548
, 2009
"... Abstract. Matrix decomposition methods provide representations of an objectvariable data matrix by a product of two different matrices, one describing relationship between objects and hidden variables or factors, and the other describing relationship between the factors and the original variables. ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Matrix decomposition methods provide representations of an objectvariable data matrix by a product of two different matrices, one describing relationship between objects and hidden variables or factors, and the other describing relationship between the factors and the original variables. We present a novel approach to decomposition and factor analysis of matrices with incidence data. The matrix entries are grades to which objects represented by rows satisfy attributes represented by columns, e.g. grades to which an image is red or a person performs well in a test. We assume that the grades belong to a scale bounded by 0 and 1 which is equipped with certain aggregation operators and forms a complete residuated lattice. We present an approximation algorithm for the problem of decomposition of such matrices with grades into products of two matrices with grades with the number of factors as small as possible. Decomposition of binary matrices into Boolean products of binary matrices is a special case of this problem in which 0 and 1 are the only grades. Our algorithm is based on a geometric insight provided by a theorem identifying particular rectangularshaped submatrices as optimal factors for the decompositions. These factors correspond to formal concepts of the input data and allow for an easy interpretation of the decomposition. We present the problem formulation, basic geometric insight, algorithm, illustrative example, experimental evaluation.
Intrinsic Dimension Estimation: Relevant Techniques and a Benchmark Framework
"... When dealing with datasets comprising highdimensional points, it is usually advantageous to discover some data structure. A fundamental information needed to this aim is the minimum number of parameters required to describe the data while minimizing the information loss. This number, usually calle ..."
Abstract
 Add to MetaCart
(Show Context)
When dealing with datasets comprising highdimensional points, it is usually advantageous to discover some data structure. A fundamental information needed to this aim is the minimum number of parameters required to describe the data while minimizing the information loss. This number, usually called intrinsic dimension, can be interpreted as the dimension of the manifold from which the input data are supposed to be drawn. Due to its usefulness in many theoretical and practical problems, in the last decades the concept of intrinsic dimension has gained considerable attention in the scientific community, motivating the large number of intrinsic dimensionality estimators proposed in the literature. However, the problem is still open since most techniques cannot efficiently deal with datasets drawn from manifolds of high intrinsic dimension and nonlinearly embedded in higher dimensional spaces. This paper surveys some of the most interesting, widespread used, and advanced stateoftheart methodologies. Unfortunately, since no benchmark database exists in this research field, an objective comparison among different techniques is not possible. Consequently, we suggest a benchmark framework and apply it to comparatively evaluate relevant stateoftheart estimators.
PrF
"... Abstract. The paper explores a utilization of Boolean factorization as a method for data preprocessing in classification of Boolean data. In previous papers, we demonstrated that data preprocessing consisting in replacing the original Boolean attributes by factors, i.e. new Boolean attributes that ..."
Abstract
 Add to MetaCart
Abstract. The paper explores a utilization of Boolean factorization as a method for data preprocessing in classification of Boolean data. In previous papers, we demonstrated that data preprocessing consisting in replacing the original Boolean attributes by factors, i.e. new Boolean attributes that are obtained from the original ones by Boolean factorization, improves the quality of classification. The aim of this paper is to explore the question of how the various Boolean factorization methods that were proposed in the literature impact the quality of classification. In particular, we compare three factorization methods, present experimental results, and outline issues for future research. Problem Setting In classification of Boolean data, the objects to classify are described by Boolean (binary, yesno) attributes. As with the other classification problems, one may be interested in preprocessing of the input attributes to improve the quality of classification. With Boolean input attributes, we might want to limit ourselves to preprocessing with a clear semantics. Namely, as it is known, see e.g.
Technology Workshops Analyzing social networks using FCA: complexity aspects
"... Abstract—Since the availability of social networks data and the range of these data have significantly grown in recent years, new aspects have to be considered. In this paper we address computational complexity of social networks analysis and clarity of their visualization. Our approach uses combina ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—Since the availability of social networks data and the range of these data have significantly grown in recent years, new aspects have to be considered. In this paper we address computational complexity of social networks analysis and clarity of their visualization. Our approach uses combination of Formal Concept Analysis and wellknown matrix factorization methods. The goal is to reduce the dimension of social network data and to measure the amount of information which is lost during the reduction. Keywordsconcept lattice, twomode social network, matrix factorization, correlation dimension I.
On Social Networks Reduction ⋆
"... Abstract. Since the availability of social networks data and the range of these data have significantly grown in recent years, new aspects have to be considered. In this paper, we use combination of Formal Concept Analysis and wellknown matrix factorization methods to address computational complexi ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. Since the availability of social networks data and the range of these data have significantly grown in recent years, new aspects have to be considered. In this paper, we use combination of Formal Concept Analysis and wellknown matrix factorization methods to address computational complexity of social networks analysis and clarity of their visualization. The goal is to reduce the dimension of social network data and to measure the amount of information, which has been lost during the reduction. Presented example containing real data proves the feasibility of our approach. 1
AMDL4BMF: Minimum Description Length for Boolean Matrix Factorization
"... Matrix factorizations—where a given data matrix is approximated by a product of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the ‘model order selection pro ..."
Abstract
 Add to MetaCart
Matrix factorizations—where a given data matrix is approximated by a product of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the ‘model order selection problem ’ of determining the proper rank of the factorization, i.e., to answer where finegrained structure stops, and where noise starts. Boolean Matrix Factorization (BMF)—where data, factors, and matrix product are Boolean—has in recent years received increased attention from the data mining community. The technique has desirable properties, such as high interpretability and natural sparsity. Yet, so far no method for selecting the correct model order for BMF has been available. In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this wellfounded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate. We formulate the description length function for BMF in general—making it applicable for any BMF algorithm. We discuss how to construct an appropriate encoding: starting from a simple and intuitive approach, we arrive at a highly efficient datatomodel based encoding for BMF. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem,
Capturing Truthiness: Mining Truth Tables in Binary Datasets
"... We introduce a new data mining problem: mining truth tables in binary datasets. Given a matrix of objects and the properties they satisfy, a truth table identifies a subset of properties that exhibit maximal variability (and hence, complete independence) in occurrence patterns over the underlying ..."
Abstract
 Add to MetaCart
(Show Context)
We introduce a new data mining problem: mining truth tables in binary datasets. Given a matrix of objects and the properties they satisfy, a truth table identifies a subset of properties that exhibit maximal variability (and hence, complete independence) in occurrence patterns over the underlying objects. This problem is relevant in many domains, e.g., bioinformatics where we seek to identify and model independent components of combinatorial regulatory pathways, and in social/economic demographics where we desire to determine independent behavioral attributes of populations. Besides intrinsic interest in such patterns, we show how the problem of mining truth tables is dual to the problem of mining redescriptions, in that a set of properties involved in a truth table cannot participate in any possible redescription. This allows us to adapt our algorithm to the problem of mining redescriptions as well, by first identifying regions where redescriptions cannot happen, and then pursuing a divide and conquer strategy around these regions. Furthermore, our work suggests dual mining strategies where both classes of algorithms can be brought to bear upon either data mining task. We outline a family of levelwise approaches adapted to mining truth tables, algorithmic optimizations, and applications to bioinformatics and political datasets.
Authors ’ Addresses
, 2012
"... Matrix factorizations—where a given data matrix is approximated by a product of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the ‘model order selection pro ..."
Abstract
 Add to MetaCart
Matrix factorizations—where a given data matrix is approximated by a product of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the ‘model order selection problem ’ of determining where finegrained structure stops, and noise starts, i.e., what is the proper size of the factor matrices. Boolean matrix factorization (BMF)—where data, factors, and matrix product are Boolean—has received increased attention from the data mining community in recent years. The technique has desirable properties, such as high interpretability and natural sparsity. However, so far no method for selecting the correct model order for BMF has been available. In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this wellfounded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, is fast,