Results 1  10
of
66
An efficient algorithm for discovering frequent subgraphs
 IEEE Transactions on Knowledge and Data Engineering
, 2002
"... Abstract — Over the years, frequent itemset discovery algorithms have been used to find interesting patterns in various application areas. However, as data mining techniques are being increasingly applied to nontraditional domains, existing frequent pattern discovery approach cannot be used. This i ..."
Abstract

Cited by 120 (7 self)
 Add to MetaCart
(Show Context)
Abstract — Over the years, frequent itemset discovery algorithms have been used to find interesting patterns in various application areas. However, as data mining techniques are being increasingly applied to nontraditional domains, existing frequent pattern discovery approach cannot be used. This is because the transaction framework that is assumed by these algorithms cannot be used to effectively model the datasets in these domains. An alternate way of modeling the objects in these datasets is to represent them using graphs. Within that model, one way of formulating the frequent pattern discovery problem is as that of discovering subgraphs that occur frequently over the entire set of graphs. In this paper we present a computationally efficient algorithm, called FSG, for finding all frequent subgraphs in large graph datasets. We experimentally evaluate the performance of FSG using a variety of real and synthetic datasets. Our results show that despite the underlying complexity associated with frequent subgraph discovery, FSG is effective in finding all frequently occurring subgraphs in datasets containing over 200,000 graph transactions and scales linearly with respect to the size of the dataset. Index Terms — Data mining, scientific datasets, frequent pattern discovery, chemical compound datasets.
A fast apriori implementation
 Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI’03), volume 90 of Workshop Proceedings
, 2003
"... The efficiency of frequent itemset mining algorithms is determined mainly by three factors: the way candidates are generated, the data structure that is used and the implementation details. Most papers focus on the first factor, some describe the underlying data structures, but implementation detail ..."
Abstract

Cited by 71 (2 self)
 Add to MetaCart
(Show Context)
The efficiency of frequent itemset mining algorithms is determined mainly by three factors: the way candidates are generated, the data structure that is used and the implementation details. Most papers focus on the first factor, some describe the underlying data structures, but implementation details are almost always neglected. In this paper we show that the effect of implementation can be more important than the selection of the algorithm. Ideas that seem to be quite promising, may turn out to be ineffective if we descend to the implementation level. We theoretically and experimentally analyze APRIORI which is the most established algorithm for frequent itemset mining. Several implementations of the algorithm have been put forward in the last decade. Although they are implementations of the very same algorithm, they display large differences in running time and memory need. In this paper we describe an implementation of APRIORI that outperforms all implementations known to us. We analyze, theoretically and experimentally, the principal data structure of our solution. This data structure is the main factor in the efficiency of our implementation. Moreover, we present a simple modification of APRIORI that appears to be faster than the original algorithm. 1
Discovering Frequent Geometric Subgraphs
 In IEEE Intl. Conference on Data Mining ’02
, 2002
"... As data mining techniques are being increasingly applied to nontraditional domains, existing approaches for finding frequent itemsets cannot be used as they cannot model the requirement of these domains. An alternate way of modeling the objects in these data sets, is to use a graph to model the ..."
Abstract

Cited by 38 (1 self)
 Add to MetaCart
(Show Context)
As data mining techniques are being increasingly applied to nontraditional domains, existing approaches for finding frequent itemsets cannot be used as they cannot model the requirement of these domains. An alternate way of modeling the objects in these data sets, is to use a graph to model the database objects. Within that model, the problem of finding frequent patterns becomes that of discovering subgraphs that occur frequently over the entire set of graphs. In this paper we present a computationally e#cient algorithm for finding frequent geometric subgraphs in a large collection of geometric graphs. Our algorithm is able to discover geometric subgraphs that can be rotation, scaling and translation invariant, and it can accommodate inherent errors on the coordinates of the vertices. We evaluated the performance of the algorithm using a large database of over 20,000 real two dimensional chemical structures, and our experimental results show that our algorithms requires relatively little time, can accommodate low support values, and scales linearly on the number of transactions.
Discovering frequent patterns in sensitive data
"... Discovering frequent patterns from data is a popular exploratory technique in data mining. However, if the data are sensitive (e.g. patient health records, user behavior records) releasing information about significant patterns or trends carries significant risk to privacy. This paper shows how one ..."
Abstract

Cited by 37 (1 self)
 Add to MetaCart
Discovering frequent patterns from data is a popular exploratory technique in data mining. However, if the data are sensitive (e.g. patient health records, user behavior records) releasing information about significant patterns or trends carries significant risk to privacy. This paper shows how one can accurately discover and release the most significant patterns along with their frequencies in a data set containing sensitive information, while providing rigorous guarantees of privacy for the individuals whose information is stored there. We present two efficient algorithms for discovering the K most frequent patterns in a data set of sensitive records. Our algorithms satisfy differential privacy, a recently introduced definition that provides meaningful privacy guarantees in the presence of arbitrary external information. Differentially private algorithms require a degree of uncertainty in their output to preserve privacy. Our algorithms handle this by returning ‘noisy ’ lists of patterns that are close to the actual list of K most frequent patterns in the data. We define a new notion of utility that quantifies the output accuracy of private topK pattern mining algorithms. In typical data sets, our utility criterion implies low false positive and false negative rates in the reported lists. We prove that our methods meet the new utility criterion; we also demonstrate the performance of our algorithms through extensive experiments on the transaction data sets from the FIMI repository. While the paper focuses on frequent pattern mining, the techniques developed here are relevant whenever the data mining output is a list of elements ordered according to an appropriately ‘robust ’ measure of interest. 1.
Efficient Hardware Data Mining with the Apriori Algorithm on FPGAs
"... The Apriori algorithm is a popular correlationbased datamining kernel. However, it is a computationally expensive algorithm and the running times can stretch up to days for large databases, as database sizes can extend to Gigabytes. Through the use of a new extension to the systolic array architect ..."
Abstract

Cited by 22 (1 self)
 Add to MetaCart
The Apriori algorithm is a popular correlationbased datamining kernel. However, it is a computationally expensive algorithm and the running times can stretch up to days for large databases, as database sizes can extend to Gigabytes. Through the use of a new extension to the systolic array architecture, time required for processing can be significantly reduced. Our array architecture implementation on a Xilinx VirtexII Pro 100 provides a performance improvement that can be orders of magnitude faster than the stateoftheart software implementations. The system is easily scalable and introduces an efficient "systolic injection " method for intelligently reporting unpredictably generated midarray results to a controller without any chance of collision or excessive stalling.
Fast Algorithm for Mining Association Rules
"... One of the important problems in data mining is discovering association rules from databases of transactions where each transaction consists of a set of items. The most time consuming operation in this discovery process is the computation of the frequency of the occurrences of interesting subset of ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
(Show Context)
One of the important problems in data mining is discovering association rules from databases of transactions where each transaction consists of a set of items. The most time consuming operation in this discovery process is the computation of the frequency of the occurrences of interesting subset of items (called candidates) in the database of transactions. Can one develop a method that may avoid or reduce candidate generation and test and utilize some novel data structures to reduce the cost in frequent pattern mining? This is the motivation of my study. A fast algorithm has been proposed for solving this problem. Our algorithm uses the &quot;TreeMap &quot; which is a structure in Java language. Also we present &quot;Arraylist&quot; technique that greatly reduces the need to traverse the database. Moreover we present experimental results which show our structure outperforms all existing available algorithms in all common data mining problems. Keywords: data mining, association rules, TreeMap, ArrayList.
nonordfp: An FPGrowth Variation without Rebuilding the FPTree
"... We describe a frequent itemset mining algorithm and
implementation based on the wellknown algorithm FPgrowth.
The theoretical difference is the main data structure
(tree), which is more compact and which we do not need to
rebuild for each conditional step. We thoroughly deal with
implementation is ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
We describe a frequent itemset mining algorithm and
implementation based on the wellknown algorithm FPgrowth.
The theoretical difference is the main data structure
(tree), which is more compact and which we do not need to
rebuild for each conditional step. We thoroughly deal with
implementation issues, data structures, memory layout, I/O
and library functions we use to achieve comparable performance as the best implementations of the 1st Frequent
Itemset Mining Implementations (FIMI) Workshop.
Memory Issues in Frequent Itemset Mining
 Proceedings of the 2004 ACM Symposium on Applied Computing
, 2004
"... During the past decade, many algorithms have been proposed to solve the frequent itemset mining problem, i.e. find all sets of items that frequently occur together in a given database of transactions. Although very e#cient techniques have been presented, they still su#er from the same problem. That ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
During the past decade, many algorithms have been proposed to solve the frequent itemset mining problem, i.e. find all sets of items that frequently occur together in a given database of transactions. Although very e#cient techniques have been presented, they still su#er from the same problem. That is, they are all inherently dependent on the amount of main memory available. Moreover, if this amount is not enough, the presented techniques are simply not applicable anymore, or significantly need to pay in performance. In this paper, we give a rigorous comparison between current state of the art techniques and present a new and simple technique, based on sorting the transaction database, resulting in a sometimes more e#cient algorithm for frequent itemset mining using less memory.
An architecture for efficient hardware data mining using reconfigurable computing systems
 Brigham Young University
, 2006
"... The Apriori algorithm is a fundamental correlationbased data mining kernel used in a variety of fields. The innovation in this paper is a highly parallel custom architecture implemented on a reconfigurable computing system. Using this “bitmapped CAM, ” the time and area required for executing the s ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
The Apriori algorithm is a fundamental correlationbased data mining kernel used in a variety of fields. The innovation in this paper is a highly parallel custom architecture implemented on a reconfigurable computing system. Using this “bitmapped CAM, ” the time and area required for executing the subset operations fundamental to data mining can be significantly reduced. The bitmapped CAM architecture implementation on an FPGAaccelerated high performance workstation provides a performance acceleration of orders of magnitude over softwarebased systems. The bitmapped CAM utilizes redundancy within the candidate data to efficiently store and process many subset operations simultaneously. The efficiency of this operation allows 140 units to process about 2,240 subset operations simultaneously. Using industrystandard benchmarking databases, we have tested the bitmapped CAM architecture on the SRC6E reconfigurable hardware system. The platform provides a minimum of 24x (and often much higher) time performance advantage over the fastest software Apriori implementations. 1
Apriori, a depth first implementation
 Proc. of the Workshop on Frequent Itemset Mining Implementations
, 2003
"... We will discuss DF, the depth £rst implementation of APRIORI as devised in 1999 (see [8]). Given a database, this algorithm builds a trie in memory that contains all frequent itemsets, i.e., all sets that are contained in at least minsup transactions from the original database. Here minsup is a thre ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
We will discuss DF, the depth £rst implementation of APRIORI as devised in 1999 (see [8]). Given a database, this algorithm builds a trie in memory that contains all frequent itemsets, i.e., all sets that are contained in at least minsup transactions from the original database. Here minsup is a threshold value given in advance. In the trie, that is constructed by adding one item at a time, every path corresponds to a unique frequent itemset. We describe the algorithm in detail, derive theoretical formulas, and provide experiments. 1