Results 1 - 10
of
3,612
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach
- DATA MINING AND KNOWLEDGE DISCOVERY
, 2004
"... Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still co ..."
Abstract
-
Cited by 1752 (64 self)
- Add to MetaCart
Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist a large number of patterns and/or long patterns. In this study, we propose a novel
frequent-pattern tree
(FP-tree) structure, which is an extended prefix-tree
structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree-
based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth.
Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a condensed,
smaller data structure, FP-tree which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts
a pattern-fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a
partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for
mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance
study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns,
and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported
new frequent-pattern mining methods
Mining Sequential Patterns
, 1995
"... We are given a large database of customer transactions, where each transaction consists of customer-id, transaction time, and the items bought in the transaction. We introduce the problem of mining sequential patterns over such databases. We present three algorithms to solve this problem, and empiri ..."
Abstract
-
Cited by 1568 (6 self)
- Add to MetaCart
(Show Context)
We are given a large database of customer transactions, where each transaction consists of customer-id, transaction time, and the items bought in the transaction. We introduce the problem of mining sequential patterns over such databases. We present three algorithms to solve this problem, and empirically evaluate their performance using synthetic data. Two of the proposed algorithms, AprioriSome and AprioriAll, have comparable performance, albeit AprioriSome performs a little better when the minimum number of customers that must support a sequential pattern is low. Scale-up experiments show that both AprioriSome and AprioriAll scale linearly with the number of customer transactions. They also have excellent scale-up properties with respect to the number of transactions per customer and the number of items in a transaction.
Mining Sequential Patterns: Generalizations and Performance Improvements
- RESEARCH REPORT RJ 9994, IBM ALMADEN RESEARCH
, 1995
"... The problem of mining sequential patterns was recently introduced in [3]. We are given a database of sequences, where each sequence is a list of transactions ordered by transaction-time, and each transaction is a set of items. The problem is to discover all sequential patterns with a user-specified ..."
Abstract
-
Cited by 759 (5 self)
- Add to MetaCart
The problem of mining sequential patterns was recently introduced in [3]. We are given a database of sequences, where each sequence is a list of transactions ordered by transaction-time, and each transaction is a set of items. The problem is to discover all sequential patterns with a user-specified minimum support, where the support of a pattern is the number of data-sequences that contain the pattern. An example of a sequential pattern is "5 % of customers bought `Foundation' and `Ringworld' in one transaction, followed by `Second Foundation ' in a later transaction". We generalize the problem as follows. First, we add time constraints that specify a minimum and/or maximum time period between adjacent elements in a pattern. Second, we relax the restriction that the items in an element of a sequential pattern must come from the same transaction, instead allowing the items to be present in a set of transactions whose transaction-times are within a user-specified time window. Third, given a user-defined taxonomy (is-a hierarchy) on items, we allow sequential patterns to include items across all levels of the taxonomy. We present GSP, a new algorithm that discovers these generalized sequential patterns. Empirical evaluation using synthetic and real-life data indicates that GSP is much faster than the AprioriAll algorithm presented in [3]. GSP scales linearly with the number of data-sequences, and has very good scale-up properties with respect to the average data-sequence size.
gSpan: Graph-Based Substructure Pattern Mining
, 2002
"... We investigate new approaches for frequent graph-based pattern mining in graph datasets and propose a novel algorithm called gSpan (graph-based Substructure pattern mining) , which discovers frequent substructures without candidate generation. gSpan builds a new lexicographic order among graphs, and ..."
Abstract
-
Cited by 650 (34 self)
- Add to MetaCart
We investigate new approaches for frequent graph-based pattern mining in graph datasets and propose a novel algorithm called gSpan (graph-based Substructure pattern mining) , which discovers frequent substructures without candidate generation. gSpan builds a new lexicographic order among graphs, and maps each graph to a unique minimum DFS code as its canonical label. Based on this lexicographic order, gSpan adopts the depth-first search strategy to mine frequent connected subgraphs efficiently. Our performance study shows that gSpan substantially outperforms previous algorithms, sometimes by an order of magnitude.
Dynamic Itemset Counting and Implication Rules for Market Basket Data
, 1997
"... We consider the problem of analyzing market-basket data and present several important contributions. First, we present a new algorithm for finding large itemsets which uses fewer passes over the data than classic algorithms, and yet uses fewer candidate itemsets than methods based on sampling. We in ..."
Abstract
-
Cited by 615 (6 self)
- Add to MetaCart
(Show Context)
We consider the problem of analyzing market-basket data and present several important contributions. First, we present a new algorithm for finding large itemsets which uses fewer passes over the data than classic algorithms, and yet uses fewer candidate itemsets than methods based on sampling. We investigate the idea of item reordering, which can improve the low-level efficiency of the algorithm. Second, we present a new way of generating "implication rules," which are normalized based on both the antecedent and the consequent and are truly implications (not simply a measure of co-occurrence), and we show how they produce more intuitive results than other methods. Finally, we show how different characteristics of real data, as opposed to synthetic data, can dramatically affect the performance of the system and the form of the results. 1 Introduction Within the area of data mining, the problem of deriving associations from data has recently received a great deal of attention. The prob...
Mining Generalized Association Rules
, 1995
"... We introduce the problem of mining generalized association rules. Given a large database of transactions, where each transaction consists of a set of items, and a taxonomy (is-a hierarchy) on the items, we find associations between items at any level of the taxonomy. For example, given a taxonomy th ..."
Abstract
-
Cited by 591 (7 self)
- Add to MetaCart
(Show Context)
We introduce the problem of mining generalized association rules. Given a large database of transactions, where each transaction consists of a set of items, and a taxonomy (is-a hierarchy) on the items, we find associations between items at any level of the taxonomy. For example, given a taxonomy that says that jackets is-a outerwear is-a clothes, we may infer a rule that "people who buy outerwear tend to buy shoes". This rule may hold even if rules that "people who buy jackets tend to buy shoes", and "people who buy clothes tend to buy shoes" do not hold. An obvious solution to the problem is to add all ancestors of each item in a transaction to the transaction, and then run any of the algorithms for mining association rules on these "extended transactions ". However, this "Basic" algorithm is not very fast; we present two algorithms, Cumulate and EstMerge, which run 2 to 5 times faster than Basic (and more than 100 times faster on one real-life dataset). We also present a new interes...
Data Preparation for Mining World Wide Web Browsing Patterns
- KNOWLEDGE AND INFORMATION SYSTEMS
, 1999
"... The World Wide Web (WWW) continues to grow at an astounding rate in both the sheer volume of tra#c and the size and complexity of Web sites. The complexity of tasks such as Web site design, Web server design, and of simply navigating through a Web site have increased along with this growth. An i ..."
Abstract
-
Cited by 567 (43 self)
- Add to MetaCart
(Show Context)
The World Wide Web (WWW) continues to grow at an astounding rate in both the sheer volume of tra#c and the size and complexity of Web sites. The complexity of tasks such as Web site design, Web server design, and of simply navigating through a Web site have increased along with this growth. An important input to these design tasks is the analysis of how a Web site is being used. Usage analysis includes straightforward statistics, such as page access frequency, as well as more sophisticated forms of analysis, such as finding the common traversal paths through a Web site. Web Usage Mining is the application of data mining techniques to usage logs of large Web data repositories in order to produce results that can be used in the design tasks mentioned above. However, there are several preprocessing tasks that must be performed prior to applying data mining algorithms to the data collected from server logs. This paper presents several data preparation techniques in order to identify unique users and user sessions. Also, a method to divide user sessions into semantically meaningful transactions is defined and successfully tested against two other methods. Transactions identified by the proposed methods are used to discover association rules from real world data using the WEBMINER system [15].
Data Mining: An Overview from Database Perspective
- IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 1996
"... Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many different fields have sh ..."
Abstract
-
Cited by 532 (26 self)
- Add to MetaCart
Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning, and by many industrial companies as an important area with an opportunity of major revenues. Researchers in many different fields have shown great interest in data mining. Several emerging applications in information providing services, such as data warehousing and on-line services over the Internet, also call for various data mining techniques to better understand user behavior, to improve the service provided, and to increase the business opportunities. In response to such a demand, this article is to provide a survey, from a database researcher's point of view, on the data mining techniques developed recently. A classification of the available data mining techniques is provided and a comparative study of such techniques is presented.
Discovery of Multiple-Level Association Rules from Large Databases
- In Proc. 1995 Int. Conf. Very Large Data Bases
, 1995
"... Previous studies on mining association rules find rules at single concept level, however, mining association rules at multiple concept levels may lead to the discovery of more specific and concrete knowledge from data. In this study, a top-down progressive deepening method is developed for mining mu ..."
Abstract
-
Cited by 463 (34 self)
- Add to MetaCart
(Show Context)
Previous studies on mining association rules find rules at single concept level, however, mining association rules at multiple concept levels may lead to the discovery of more specific and concrete knowledge from data. In this study, a top-down progressive deepening method is developed for mining multiplelevel association rules from large transaction databases by extension of some existing association rule mining techniques. A group of variant algorithms are proposed based on the ways of sharing intermediate results, with the relative performance tested on different kinds of data. Relaxation of the rule conditions for finding "level-crossing" association rules is also discussed in the paper.
Efficiently mining long patterns from databases
, 1998
"... We present a pattern-mining algorithm that scales roughly linearly in the number of maximal patterns embedded in a database irrespective of the length of the longest pattern. In comparison, previous algorithms based on Apriori scale exponentially with longest pattern length. Experiments on real data ..."
Abstract
-
Cited by 457 (3 self)
- Add to MetaCart
(Show Context)
We present a pattern-mining algorithm that scales roughly linearly in the number of maximal patterns embedded in a database irrespective of the length of the longest pattern. In comparison, previous algorithms based on Apriori scale exponentially with longest pattern length. Experiments on real data show that when the patterns are long, our algorithm is more efficient by an order of magnimaximal frequent itemset, Max-Miner’s output implicitly and concisely represents all frequent itemsets. Max-Miner is shown to result in two or more orders of magnitude in performance improvements over Apriori on some data-sets. On other data-sets where the patterns are not so long, the gains are more modest. In practice, Max-Miner is demonstrated to run in time that is roughly linear in the number of maximal frequent itemsets and the size of the database, irrespective of the size of the longest frequent itemset. tude or more. 1.