Results 1  10
of
113
CHARM: An efficient algorithm for closed itemset mining
, 2002
"... The set of frequent closed itemsets uniquely determines the exact frequency of all itemsets, yet it can be orders of magnitude smaller than the set of all frequent itemsets. In this paper we present CHARM, an efficient algorithm for mining all frequent closed itemsets. It enumerates closed sets usin ..."
Abstract

Cited by 320 (14 self)
 Add to MetaCart
(Show Context)
The set of frequent closed itemsets uniquely determines the exact frequency of all itemsets, yet it can be orders of magnitude smaller than the set of all frequent itemsets. In this paper we present CHARM, an efficient algorithm for mining all frequent closed itemsets. It enumerates closed sets using a dual itemsettidset search tree, using an efficient hybrid search that skips many levels. It also uses a technique called diffsets to reduce the memory footprint of intermediate computations. Finally it uses a fast hashbased approach to remove any “nonclosed” sets found during computation. An extensive experimental evaluation on a number of real and synthetic databases shows that CHARM significantly outperforms previous methods. It is also linearly scalable in the number of transactions.
Mining All NonDerivable Frequent Itemsets
, 2002
"... Recent studies on frequent itemset mining algorithms resulted in significant performance improvements. However, if the minimal support threshold is set too low, or the data is highly correlated, the number of frequent itemsets itself can be prohibitively large. To overcome this problem, recently sev ..."
Abstract

Cited by 127 (12 self)
 Add to MetaCart
(Show Context)
Recent studies on frequent itemset mining algorithms resulted in significant performance improvements. However, if the minimal support threshold is set too low, or the data is highly correlated, the number of frequent itemsets itself can be prohibitively large. To overcome this problem, recently several proposals have been made to construct a concise representation of the frequent itemsets, instead of mining all frequent itemsets. The main goal of this paper is to identify redundancies in the set of all frequent itemsets and to exploit these redundancies in order to reduce the result of a mining operation. We present deduction rules to derive tight bounds on the support of candidate itemsets. We show how the deduction rules allow for constructing a minimal representation for all frequent itemsets. We also present connections between our proposal and recent proposals for concise representations and we give the results of experiments on reallife datasets that show the effectiveness of the deduction rules. In fact, the experiments even show that in many cases, first mining the concise representation, and then creating the frequent itemsets from this representation outperforms existing frequent set mining algorithms.
Computing Iceberg Concept Lattices with TITANIC
, 2002
"... We introduce the notion of iceberg concept lattices... ..."
Abstract

Cited by 112 (15 self)
 Add to MetaCart
We introduce the notion of iceberg concept lattices...
Efficient Algorithms for Mining Closed Itemsets and Their Lattice Structure
 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2005
"... The set of frequent closed itemsets uniquely determines the exact frequency of all itemsets, yet it can be orders of magnitude smaller than the set of all frequent itemsets. In this paper, we present CHARM, an efficient algorithm for mining all frequent closed itemsets. It enumerates closed sets u ..."
Abstract

Cited by 85 (7 self)
 Add to MetaCart
The set of frequent closed itemsets uniquely determines the exact frequency of all itemsets, yet it can be orders of magnitude smaller than the set of all frequent itemsets. In this paper, we present CHARM, an efficient algorithm for mining all frequent closed itemsets. It enumerates closed sets using a dual itemsettidset search tree, using an efficient hybrid search that skips many levels. It also uses a technique called diffsets to reduce the memory footprint of intermediate computations. Finally, it uses a fast hashbased approach to remove any "nonclosed" sets found during computation. We also present CHARML, an algorithm that outputs the closed itemset lattice, which is very useful for rule generation and visualization. An extensive experimental evaluation on a number of real and synthetic databases shows that CHARM is a stateoftheart algorithm that outperforms previous methods. Further, CHARML explicitly generates the frequent closed itemset lattice.
Association Mining
, 2006
"... The task of finding correlations between items in a dataset, association mining, has received considerable attention over the last decade. This article presents a survey of association mining fundamentals, detailing the evolution of association mining algorithms from the seminal to the stateofthe ..."
Abstract

Cited by 61 (1 self)
 Add to MetaCart
The task of finding correlations between items in a dataset, association mining, has received considerable attention over the last decade. This article presents a survey of association mining fundamentals, detailing the evolution of association mining algorithms from the seminal to the stateoftheart. This survey focuses on the fundamental principles of association mining, that is, itemset identification, rule generation, and their generic optimizations.
Depthfirst nonderivable itemset mining
 In SIAM Int. Conf. on Data Mining (SDM’05
, 2005
"... Mining frequent itemsets is one of the main problems in data mining. Much effort went into developing efficient and scalable algorithms for this problem. When the support threshold is set too low, however, or the data is highly correlated, the number of frequent itemsets can become too large, indepe ..."
Abstract

Cited by 58 (7 self)
 Add to MetaCart
(Show Context)
Mining frequent itemsets is one of the main problems in data mining. Much effort went into developing efficient and scalable algorithms for this problem. When the support threshold is set too low, however, or the data is highly correlated, the number of frequent itemsets can become too large, independently of the algorithm used. Therefore, it is often more interesting to mine a reduced collection of interesting itemsets, i.e., a condensed representation. Recently, in this context, the nonderivable itemsets were proposed as an important class of itemsets. An itemset is called derivable when its support is completely determined by the support of its subsets. As such, derivable itemsets represent redundant information and can be pruned from the collection of frequent itemsets. It was shown both theoretically and experimentally that the collection of nonderivable frequent itemsets is in general much smaller than the complete set of frequent itemsets. A breadthfirst, Aprioribased algorithm, called NDI, to find all nonderivable itemsets was proposed. In this paper we present a depthfirst algorithm, dfNDI, that is based on Eclat for mining the nonderivable itemsets. dfNDI is evaluated on reallife datasets, and experiments show that dfNDI outperforms NDI with an order of magnitude. 1
Generating a condensed representation for association rules
 JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, KLUWER ACADEMIC PUBLISHER
, 2005
"... Association rule extraction from operational datasets often produces several tens of thousands, and even millions, of association rules. Moreover, many of these rules are redundant and thus useless. Using a semantic based on the closure of the Galois connection, we define a condensed representation ..."
Abstract

Cited by 49 (6 self)
 Add to MetaCart
Association rule extraction from operational datasets often produces several tens of thousands, and even millions, of association rules. Moreover, many of these rules are redundant and thus useless. Using a semantic based on the closure of the Galois connection, we define a condensed representation for association rules. This representation is characterized by frequent closed itemsets and their generators. It contains the nonredundant association rules having minimal antecedent and maximal consequent, called minmax association rules. We think that these rules are the most relevant since they are the most general nonredundant association rules. Furthermore, this representation is a basis, i.e., a generating set for all association rules, their supports and their confidences, and all of them can be retrieved needless accessing the data. We introduce algorithms for extracting this basis and for reconstructing all association rules. Results of experiments carried out on real datasets show the usefulness of this approach. In order to generate this basis when an algorithm for extracting frequent itemsets—such as APRIORI for instance—is used, we also present an algorithm for deriving frequent closed itemsets and their generators from frequent itemsets without using the dataset.
A survey on condensed representations for frequent sets
 In: Constraint Based Mining and Inductive Databases, SpringerVerlag, LNAI
, 2005
"... Abstract. Solving inductive queries which have to return complete collections of patterns satisfying a given predicate has been studied extensively the last few years. The specific problem of frequent set mining from potentially huge boolean matrices has given rise to tens of efficient solvers. Freq ..."
Abstract

Cited by 38 (4 self)
 Add to MetaCart
(Show Context)
Abstract. Solving inductive queries which have to return complete collections of patterns satisfying a given predicate has been studied extensively the last few years. The specific problem of frequent set mining from potentially huge boolean matrices has given rise to tens of efficient solvers. Frequent sets are indeed useful for many data mining tasks, including the popular association rule mining task but also feature construction, associationbased classification, clustering, etc. The research in this area has been boosted by the fascinating concept of condensed representations w.r.t. frequency queries. Such representations can be used to support the discovery of every frequent set and its support without looking back at the data. Interestingly, the size of condensed representations can be several orders of magnitude smaller than the size of frequent set collections. Most of the proposals concern exact representations while it is also possible to consider approximated ones, i.e., to trade computational complexity with a bounded approximation on the computed support values. This paper surveys the core concepts used in the recent works on condensed representation for frequent sets. 1
Mining Statistically Important Equivalence Classes and DeltaDiscriminative Emerging Patterns
, 2007
"... The supportconfidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In thi ..."
Abstract

Cited by 38 (2 self)
 Add to MetaCart
The supportconfidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In this paper, we introduce a novel algorithm that produces itemsets with ranked statistical merits under sophisticated test statistics such as chisquare, risk ratio, odds ratio, etc. Our algorithm is based on the concept of equivalence classes. An equivalence class is a set of frequent itemsets that always occur together in the same set of transactions. Therefore, itemsets within an equivalence class all share the same level of statistical significance regardless of the variety of test statistics. As an equivalence class can be uniquely determined and concisely represented by a closed pattern and a set of generators, we just mine closed patterns and generators, taking a simultaneous depthfirst search scheme. This parallel approach has not been exploited by any prior work. We evaluate our algorithm on two aspects. In general, we compare to LCM and FPclose which are the best algorithms tailored for mining only closed patterns. In particular, we compare to epMiner which is the most recent algorithm for mining a type of relative risk patterns, known as minimal emerging patterns. Experimental results show that our algorithm is faster than all of them, sometimes even multiple orders of magnitude faster. These statistically ranked patterns and the efficiency have a high potential for reallife applications, especially in biomedical and financial fields where classical test statistics are of dominant interest.
Adaptive and ResourceAware Mining of Frequent Sets
, 2002
"... The performance of an algorithm that mines frequent sets from transactional databases may severely depend on the specific features of the data being analyzed. Moreover, some architectural characteristics of the computational platform used  e.g. the available main memory  can dramatically chang ..."
Abstract

Cited by 35 (8 self)
 Add to MetaCart
The performance of an algorithm that mines frequent sets from transactional databases may severely depend on the specific features of the data being analyzed. Moreover, some architectural characteristics of the computational platform used  e.g. the available main memory  can dramatically change the runtime behaviors of the algorithm. In this paper we present DCI (Direct Count & Intersect), an e#cient data mining algorithm for discovering frequent sets from large databases, which e#ectively addresses the issues mentioned above. DCI adopts a classical levelwise approach based on candidate generation to extract frequent sets, but uses a hybrid method to determine candidate supports. The most innovative contribution of DCI relies on the multiple heuristics strategies employed, which permits DCI to adapt its behavior not only to the features of the specific computing platform, but also to the features of the dataset being mined, so that it results e#ective in mining both short and long patterns from sparse and dense datasets. The large amount of tests conducted permit us to state that DCI sensibly outperforms stateofthe art algorithms for both synthetic and realworld datasets. Finally we also discuss the parallelization strategies adopted in the design of ParDCI, a distributed and multithreaded implementation of DCI. ParDCI