Results 1  10
of
20
Efficient incremental mining of topk frequent closed itemsets
 In Proc. of the 10th International Conference on Discovery Science (DS 07), LNCS
, 2007
"... Abstract. In this work we study the mining of topK frequent closed itemsets, a recently proposed variant of the classical problem of mining frequent closed itemsets where the support threshold is chosen as the maximum value sufficient to guarantee that the itemsets returned in output be at least K. ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Abstract. In this work we study the mining of topK frequent closed itemsets, a recently proposed variant of the classical problem of mining frequent closed itemsets where the support threshold is chosen as the maximum value sufficient to guarantee that the itemsets returned in output be at least K. We discuss the effectiveness of parameter K in controlling the output size and develop an efficient algorithm for mining topK frequent closed itemsets in order of decreasing support, which exhibits consistently better performance than the best previously known one, attaining substantial improvements in some cases. A distinctive feature of our algorithm is that it allows the user to dynamically raise the value K with no need to restart the computation from scratch. 1
Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees
 ACM Trans. Knowl. Disc. from Data
"... The tasks of extracting (topK) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, pos ..."
Abstract

Cited by 5 (5 self)
 Add to MetaCart
(Show Context)
The tasks of extracting (topK) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI’s and AR’s are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under or oversampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of VapnikChervonenkis (VC) dimension to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within userspecified parameters. Our technique applies both to absolute and to relative approximations of (topK) FI’s and AR’s. The resulting sample size is linearly dependent on the VCdimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a characterization of the VCdimension of this range space and a proof that it is upper bounded by an easytocompute characteristic quantity of the dataset which we call dindex, namely the maximum integer d such that the dataset contains at least d transactions of length at least d. We show that this bound is strict for a large class of datasets. The resulting sample size for an absolute (resp. relative) (ε, δ)approximation of the collection of FI’s is O ( 1 ε2 (d + log 1 δ)) (resp. O ( 2+ε ε2(2−ε)θ (d log
Mining topK itemsets over a sliding window based on zipfian Distribution
 In SIAM International Conference on Data Mining
, 2005
"... Frequent pattern discovery in data streams can be very useful in different applications. In time critical applications, a sliding window model is needed to discount stale data. In this paper, we adopt this model to mine the K most interesting itemsets, or to estimate the K most frequent itemsets of ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Frequent pattern discovery in data streams can be very useful in different applications. In time critical applications, a sliding window model is needed to discount stale data. In this paper, we adopt this model to mine the K most interesting itemsets, or to estimate the K most frequent itemsets of different sizes in a data stream. In our method, the sliding window is partitioned into buckets. We maintain the statistics of the frequency counts of the itemsets for the transactions in each bucket. We prove that our algorithm guarantees no false negatives for any data distributions. We also show that the number of false positives returned is typically small according to Zipfian Distribution. Our experiments on synthetic data show that the memory used by our method is tens of times smaller than that of a naive approach, and the false positives are negligible. 1
Discovering Interesting Association Rules: A Multiobjective Genetic Algorithm Approach
"... Association rule mining is considered as one of the important tasks of data mining intended towards decision making process. It has been mainly developed to identify interesting associations and/or correlation relationships between frequent itemsets in datasets. A multiobjective genetic algorithm a ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Association rule mining is considered as one of the important tasks of data mining intended towards decision making process. It has been mainly developed to identify interesting associations and/or correlation relationships between frequent itemsets in datasets. A multiobjective genetic algorithm approach is proposed in this paper for the discovery of interesting association rules with multiple criteria i.e. support, confidence and simplicity (comprehensibility). With Genetic Algorithm (GA), a global search can be achieved and system automation is developed, because the proposed algorithm could identify interesting association rules from a dataset without having the userspecified thresholds of minimum support and minimum confidence. The experimental results on various types of datasets show the usefulness and effectiveness of the proposed algorithm.
Mining TopK Strongly Correlated Item Pairs Without Minimum Correlation Threshold
"... Abstract Given a userspecified minimum correlation threshold and a transaction database, the problem of mining strongly correlated item pairs is to find all item pairs with Pearson's correlation coefficients above the threshold. However, setting such a threshold is by no means an easy task. In ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract Given a userspecified minimum correlation threshold and a transaction database, the problem of mining strongly correlated item pairs is to find all item pairs with Pearson's correlation coefficients above the threshold. However, setting such a threshold is by no means an easy task. In this paper, we consider a more practical problem: mining topk strongly correlated item pairs, where k is the desired number of item pairs that have largest correlation values. Based on the FPtree data structure, we propose an efficient algorithm, called Tkcp, for mining such patterns without minimum correlation threshold. Our experimental results show that Tkcp algorithm outperforms the Taper algorithm, one efficient algorithm for mining correlated item pairs, even with the assumption of an optimally chosen correlation threshold. Thus, we conclude that mining topk strongly correlated pairs without minimum correlation threshold is more preferable than the original correlation threshold based mining.
Index Support for Item Set Mining A Case Study
"... Abstract: The increase in huge amount of data is seen clearly in present days because of requirement for storing more information. To extract certain data from this large database is a very difficult task. This leads to the researchers to drag themselves for developing better technique to mine the r ..."
Abstract
 Add to MetaCart
Abstract: The increase in huge amount of data is seen clearly in present days because of requirement for storing more information. To extract certain data from this large database is a very difficult task. This leads to the researchers to drag themselves for developing better technique to mine the required data. There are various techniques proposed by several researchers to deal with this difficulty. Among various available techniques, association rule mining for extract the required data from the database is found to be better. This paper presents the IMine index, a general and compact structure which provides tight integration of item set extraction in a relational DBMS Since no constraint is enforced during the index creation phase, IMine provides a complete representation of the original database. To reduce the I/O cost, data accessed together during the same extraction phase are clustered on the same disk block. Experiments, run for both sparse and dense data distributions, show the efficiency of the proposed index and its linear scalability also for large data sets.
Frequent Closed Itemsets Based Condensed Representations for Association Rules
"... ..."
(Show Context)
Mining High Utility Itemsets from Large Dynamic Dataset by Eliminating Unusual Items
"... Utilitybased data mining is a new research area interested in all types of utility factors in data mining processes [1]. The basic meaning of utility is the quantity sold, interest, importance & profitability of items to the users. Utility of items in a transaction database consists of two aspe ..."
Abstract
 Add to MetaCart
(Show Context)
Utilitybased data mining is a new research area interested in all types of utility factors in data mining processes [1]. The basic meaning of utility is the quantity sold, interest, importance & profitability of items to the users. Utility of items in a transaction database consists of two aspects: 1. The importance of distinct or unique items, which is called external utility. 2. The importance of the items in the transaction, w is called as internal utility. Mining high utility itemsets from the databases is not an easy task. Pruning search space for high utility itemset mining is difficult because a superset of a low utility itemset may be a high utility itemset. Existing studies [2,4,9] applied overestimated methods to facilitate the mining performance of utility mining. In these methods, first we will get potential high utility itemsets, and then an additional database scan is performed for identifying their utilities. However, the existing methods often generate a huge candidate itemsets and the mining performance is degraded consequently. In this paper we proposed Eliminating Unusual Itemset by Eliminating item set which is low utility item set to reduce search space. Proposed methods not only reduce the number of candidate itemsets, but also significantly increase the performance of the mining process.
for Item Set Mining (IMine)
"... Abstract — The increase in huge amount of data is seen clearly in present days because of requirement for storing more information. To extract certain data from this large database is a very difficult task. This leads to the researchers to drag themselves for developing better technique to mine the ..."
Abstract
 Add to MetaCart
Abstract — The increase in huge amount of data is seen clearly in present days because of requirement for storing more information. To extract certain data from this large database is a very difficult task. This leads to the researchers to drag themselves for developing better technique to mine the required data. There are various techniques proposed by several researchers to deal with this difficulty. Among various available techniques, association rule mining for extract the required data from the database is found to be better. This paper presents the IMine index, a general and compact structure which provides tight integration of item set extraction in a relational DBMS Since no constraint is enforced during the index creation phase, IMine provides a complete representation of the original database. To reduce the I/O cost, data accessed together during the same extraction phase are clustered on the same disk block. Experiments, run for both sparse and dense data distributions, show the efficiency of the proposed index and its linear scalability also for large data sets.
DOI 10.1007/s106180060042x Mining topK frequent itemsets from data streams
"... Abstract Frequent pattern mining on data streams is of interest recently. However, it is not easy for users to determine a proper frequency threshold. It is more reasonable to ask users to set a bound on the result size. We study the problem of mining top K frequent itemsets in data streams. We intr ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract Frequent pattern mining on data streams is of interest recently. However, it is not easy for users to determine a proper frequency threshold. It is more reasonable to ask users to set a bound on the result size. We study the problem of mining top K frequent itemsets in data streams. We introduce a method based on the Chernoff bound with a guarantee of the output quality and also a bound on the memory usage. We also propose an algorithm based on the Lossy Counting Algorithm. In most of the experiments of the two proposed algorithms, we obtain perfect solutions and the memory space occupied by our algorithms is very small. Besides, we also propose the adapted approach of these two algorithms in order to handle the case when we are interested in mining the data in a sliding window. The experiments show that the results are accurate. Keywords Data mining algorithm. Data stream. Top K frequent itemset mining. Sliding window. Chernoff bound. Probabilistic algorithm 1.