Results 1 - 10
of
17
A survey on condensed representations for frequent sets
- In: Constraint Based Mining and Inductive Databases, Springer-Verlag, LNAI
, 2005
"... Abstract. Solving inductive queries which have to return complete collections of patterns satisfying a given predicate has been studied extensively the last few years. The specific problem of frequent set mining from potentially huge boolean matrices has given rise to tens of efficient solvers. Freq ..."
Abstract
-
Cited by 38 (4 self)
- Add to MetaCart
(Show Context)
Abstract. Solving inductive queries which have to return complete collections of patterns satisfying a given predicate has been studied extensively the last few years. The specific problem of frequent set mining from potentially huge boolean matrices has given rise to tens of efficient solvers. Frequent sets are indeed useful for many data mining tasks, including the popular association rule mining task but also feature construction, association-based classification, clustering, etc. The research in this area has been boosted by the fascinating concept of condensed representations w.r.t. frequency queries. Such representations can be used to support the discovery of every frequent set and its support without looking back at the data. Interestingly, the size of condensed representations can be several orders of magnitude smaller than the size of frequent set collections. Most of the proposals concern exact representations while it is also possible to consider approximated ones, i.e., to trade computational complexity with a bounded approximation on the computed support values. This paper surveys the core concepts used in the recent works on condensed representation for frequent sets. 1
Mining concepts from large SAGE gene expression matrices
- In: Proceedings KDID’03 co-located with ECML-PKDD 2003, Catvat-Dubrovnik (Croatia
, 2003
"... Abstract. One of the crucial needs in post-genomic research is to analyze expression matrices (e.g., SAGE and microarray data) to identify a priori interesting sets of genes, e.g., sets of genes that are frequently co-regulated. Such matrices provide expression values for given biological situations ..."
Abstract
-
Cited by 13 (8 self)
- Add to MetaCart
(Show Context)
Abstract. One of the crucial needs in post-genomic research is to analyze expression matrices (e.g., SAGE and microarray data) to identify a priori interesting sets of genes, e.g., sets of genes that are frequently co-regulated. Such matrices provide expression values for given biological situations (the lines) and given genes (columns). The inductive database framework enables to support knowledge discovery processes by means of sequences of queries that concerns both data processing and pattern querying (extraction, post-processing). We provide a simple formalization of a relevant pattern domain (language of patterns, evaluation functions and primitive constraints) that has been proved useful for specifying various analysis tasks. Recent algorithmic results w.r.t. the efficient evaluation (constraint-based mining) of the so-called inductive queries are emphasized and illustrated on a 90 × 12 636 human SAGE expression matrix. 1
Looking for monotonicity properties of a similarity constraint on sequences
- In ACM Symposium of Applied Computing SAC’2006, Special Track on Data Mining
, 2006
"... Constraint-based mining techniques on sequence databases have been studied extensively the last few years and efficient algorithms enable to compute complete collections of patterns (e.g., sequences) which satisfy conjunctions of monotonic and/or anti-monotonic constraints. Studying new applications ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
(Show Context)
Constraint-based mining techniques on sequence databases have been studied extensively the last few years and efficient algorithms enable to compute complete collections of patterns (e.g., sequences) which satisfy conjunctions of monotonic and/or anti-monotonic constraints. Studying new applications of these techniques, we believe that a primitive constraint which enforces enough similarity w.r.t a given reference sequence would be extremely useful and should benefit from such a recent algorithmic breakthrough. A non trivial similarity constraint is however neither monotonic nor anti-monotonic. Therefore, we have studied its definition as a conjunction of two constraints which satisfy the desired monotonicity properties: a pattern is called similar to a reference pattern x when its longest common subsequence with x (LCS) is large enough (i.e., a monotonic part) and when the number of deletions such that it becomes the LCS is small enough (i.e., an anti-monotonic part). We provide an experimental validation which confirms the added value of this approach on a biological database. Classical issues like scalability and pruning efficiency are discussed. 1.
Data mining query languages
- In The Data Mining and Knowledge Discovery Handbook
, 2005
"... Summary. Many Data Mining algorithms enable to extract different types of patterns from data (e.g., local patterns like itemsets and association rules, models like classifiers). To support the whole knowledge discovery process, we need for integrated systems which can deal either with patterns and d ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Summary. Many Data Mining algorithms enable to extract different types of patterns from data (e.g., local patterns like itemsets and association rules, models like classifiers). To support the whole knowledge discovery process, we need for integrated systems which can deal either with patterns and data. The inductive database approach has emerged as an unifying frame-work for such systems. Following this database perspective, knowledge discovery processes become querying processes for which query languages have to be designed. In the prolific field of association rule mining, different proposals of query languages have been made to support the more or less declarative specification of both data and pattern manipulations. In this chapter, we survey some of these proposals. It enables to identify nowadays shortcomings and to point out some promising directions of research in this area.
J.F.: Constraint-based mining of fault-tolerant patterns from boolean data
- In Revised Selected and Invited Papers KDID’05, volume 3933 of LNCS
, 2006
"... Abstract. Thanks to an important research effort the last few years, inductive queries on local patterns (e.g., set patterns) and complete solvers which can evaluate them on large data sets have been proved extremely useful. The more we use such queries on real-life data, e.g., biological data (and ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
(Show Context)
Abstract. Thanks to an important research effort the last few years, inductive queries on local patterns (e.g., set patterns) and complete solvers which can evaluate them on large data sets have been proved extremely useful. The more we use such queries on real-life data, e.g., biological data (and thus intrinsically dirty and noisy), the more we are convinced that inductive queries should return fault-tolerant patterns. In this work, we consider user-defined constraints for a declarative specification of faulttolerance. We discuss the design of such constraints on bi-sets extracted from Boolean data sets. Our starting point is the fundamental limitation of formal concept discovery (i.e., closed set mining) from noisy data and we propose a constraint-based mining approach for relevant faulttolerant bi-set mining. Formalizing three recent proposals, our framework enables a better understanding of the needed trade-off between extraction feasibility, completeness, relevancy, and ease of interpretation of these fault-tolerant patterns. An original empirical evaluation on both synthetic and real-life medical data is given. It enables a comparison of the various proposals and it motivates further directions of research. 1
Actionability and Formal Concepts: A Data Mining Perspective
"... Abstract. The last few years, we have studied different set pattern mining techniques from binary data. It includes the computation of formal concepts to support various knowledge discovery processes. For instance, when considering post-genomics, we can exploit Boolean data sets that encode a relati ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
(Show Context)
Abstract. The last few years, we have studied different set pattern mining techniques from binary data. It includes the computation of formal concepts to support various knowledge discovery processes. For instance, when considering post-genomics, we can exploit Boolean data sets that encode a relation between some genes and the proteins that may regulate them. In such a context, it appears interesting to exploit the analogy between a putative transcriptional module (i.e., a typically important hypothesis for gene regulation understanding) and a formal concept that holds within such data. In this paper, we assume that knowledge nuggets can be captured by collections of formal concepts and we discuss the challenging issue of mining/selecting actionable patterns from these collections, i.e., looking for relevant patterns that really support knowledge discovery. Therefore, a major issue concerns the computation of complete collections of formal concepts that satisfy user-defined constraints. This is useful not only to avoid the computation of too small patterns that might be due to noise (e.g., using size constraints on both their intents and extents) but also to introduce some fault-tolerance. We discuss the pros and the cons of some recent proposals in that direction. 1
Introduction Introducing Softness into Inductive Queries on String Databases
"... ..."
(Show Context)
Modelling and Operation Issues for Pattern Base Management Systems
- PhD thesis, Knowledge and Database Systems Laboratory, School of Electrical and Computer Engineering, NTUA
, 2007
"... ..."
(Show Context)
Transaction databases, frequent itemsets, and their condensed representations
"... Abstract. Mining frequent itemsets is a fundamental task in data mining. Unfortunately the number of frequent itemsets describing the data is often too large to comprehend. This problem has been attacked by condensed representations of frequent itemsets that are subcollections of frequent itemsets c ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Mining frequent itemsets is a fundamental task in data mining. Unfortunately the number of frequent itemsets describing the data is often too large to comprehend. This problem has been attacked by condensed representations of frequent itemsets that are subcollections of frequent itemsets containing only the frequent itemsets that cannot be deduced from other frequent itemsets in the subcollection, using some deduction rules. In this paper we review the most popular condensed representations of frequent itemsets, study their relationship to transaction databases and each other, examine their combinatorial and computational complexity, and describe their relationship to other important concepts in combinatorial data analysis, such as Vapnik-Chervonenkis dimension and hypergraph transversals. 1
Itemset support queries using frequent itemsets and their condensed representations
- In: Proceedings of the 9th International Conference Discovery Science (DS 2006), Springer-Verlag, LNCS
, 2006
"... Abstract. The purpose of this paper is two-fold: First, we give efficient algorithms for answering itemset support queries for collections of itemsets from various representations of the frequency information. As index structures we use itemset tries of transaction databases, frequent itemsets and t ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. The purpose of this paper is two-fold: First, we give efficient algorithms for answering itemset support queries for collections of itemsets from various representations of the frequency information. As index structures we use itemset tries of transaction databases, frequent itemsets and their condensed representations. Second, we evaluate the usefulness of condensed representations of frequent itemsets to answer itemset support queries using the proposed query algorithms and index structures. We study analytically the worst-case time complexities of querying condensed representations and evaluate experimentally the query efficiency with random itemset queries to several benchmark transaction databases. 1