Results 1 - 10
of
65
Mining Sequential Patterns
, 1995
"... We are given a large database of customer transactions, where each transaction consists of customer-id, transaction time, and the items bought in the transaction. We introduce the problem of mining sequential patterns over such databases. We present three algorithms to solve this problem, and empiri ..."
Abstract
-
Cited by 931 (5 self)
- Add to MetaCart
We are given a large database of customer transactions, where each transaction consists of customer-id, transaction time, and the items bought in the transaction. We introduce the problem of mining sequential patterns over such databases. We present three algorithms to solve this problem, and empirically evaluate their performance using synthetic data. Two of the proposed algorithms, AprioriSome and AprioriAll, have comparable performance, albeit AprioriSome performs a little better when the minimum number of customers that must support a sequential pattern is low. Scale-up experiments show that both AprioriSome and AprioriAll scale linearly with the number of customer transactions. They also have excellent scale-up properties with respect to the number of transactions per customer and the number of items in a transaction. 1 Introduction Database mining is motivated by the decision support problem faced by most large retail organizations. Progress in bar-code technology has made it po...
Mining Sequential Patterns: Generalizations and Performance Improvements
- Research Report RJ 9994, IBM Almaden Research
, 1995
"... Abstract. The problem of mining sequential patterns was recently introduced in [3]. We are given a database of sequences, where each sequence is a list of transactions ordered by transaction-time, and each transaction is a set of items. The problem is to discover all sequential patterns with a user- ..."
Abstract
-
Cited by 446 (3 self)
- Add to MetaCart
Abstract. The problem of mining sequential patterns was recently introduced in [3]. We are given a database of sequences, where each sequence is a list of transactions ordered by transaction-time, and each transaction is a set of items. The problem is to discover all sequential patterns with a user-speci ed minimum support, where the support of a pattern is the number of data-sequences that contain the pattern. An example of a sequential pattern is \5 % of customers bought `Foundation' and `Ringworld ' in one transaction, followed by `Second Foundation ' in a later transaction". We generalize the problem as follows. First, we add time constraints that specify a minimum and/or maximum time period between adjacent elements in a pattern. Second, we relax the restriction that the items in an element of a sequential pattern must come from the same transaction, instead allowing the items to be present in a set of transactions whose transaction-times are within a user-speci ed time window. Third, given a user-de ned taxonomy (is-a hierarchy) on items, we allow sequential patterns to include items across all levels of the taxonomy. We present GSP, a new algorithm that discovers these generalized sequential patterns. Empirical evaluation using synthetic and real-life data indicates that GSP is much faster than the AprioriAll algorithm presented in [3]. GSP scales linearly with the number of data-sequences, and has very good scale-up properties with respect to the average datasequence size. 1
An efficient algorithm for mining association rules in large databases
, 1995
"... Mining for a.ssociation rules between items in a large database of sales transactions has been described as an important database mining problem. In this paper we present an effi-cient algorithm for mining association rules that is fundamentally different from known al-gorithms. Compared to previous ..."
Abstract
-
Cited by 330 (0 self)
- Add to MetaCart
Mining for a.ssociation rules between items in a large database of sales transactions has been described as an important database mining problem. In this paper we present an effi-cient algorithm for mining association rules that is fundamentally different from known al-gorithms. Compared to previous algorithms, our algorithm not only reduces the I/O over-head significantly but also has lower CPU overhead for most cases. We have performed extensive experiments and compared the per-formance of our algorithm with one of the best existing algorithms. It was found that for large databases, the CPU overhead was re-duced by as much as a factor of four and I/O was reduced by almost an order of magnitude. Hence this algorithm is especially suitable for very large size databases. 1
Discovery of Frequent Episodes in Event Sequences
- DATA MINING AND KNOWLEDGE DISCOVERY
, 1997
"... Sequences of events describing the behavior and actions of users or systems can be collected in several domains. We consider the problem of discovering frequently occurring episodes in such sequences. An episode is defined to be a collection of events that occur relatively close to each other in a g ..."
Abstract
-
Cited by 250 (14 self)
- Add to MetaCart
Sequences of events describing the behavior and actions of users or systems can be collected in several domains. We consider the problem of discovering frequently occurring episodes in such sequences. An episode is defined to be a collection of events that occur relatively close to each other in a given partial order. Once such episodes are known, one can produce rules for describing or predicting the behavior of the sequence. We give efficient algorithms for the discovery of all frequent episodes from a given class of episodes, and present extensive experimental results. The methods are in use in telecommunication alarm management.
Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases
- In VLDB
, 1995
"... We introduce a new model of similarity of time sequences that captures the intuitive notion that two sequences should be considered similar if they have enough non-overlapping time-ordered pairs of subsequences thar are similar. The model allows the amplitude of one of the two sequences to be scaled ..."
Abstract
-
Cited by 182 (6 self)
- Add to MetaCart
We introduce a new model of similarity of time sequences that captures the intuitive notion that two sequences should be considered similar if they have enough non-overlapping time-ordered pairs of subsequences thar are similar. The model allows the amplitude of one of the two sequences to be scaled by any suitable amount and its offset adjusted appropriately. Two subsequences are considered similar if one can be enclosed within an envelope of a specified width drawn around the other. The model also allows non-matching gaps in the matching subsequences. The matching subsequences need not be aligned along the time axis. Given this model of similarity,we present fast search techniques for discovering all similar sequences in a set of sequences. These techniques can also be used to find all (sub)sequences similar to a given sequence. We applied this matching system to the U.S. mutual funds data and discovered interesting matches.
SPIRIT: Sequential Pattern Mining with Regular Expression Constraints
, 1999
"... Discovering sequential patterns is an important problem in data mining with a host of application domains including medicine, telecommunications, and the World Wide Web. Conventional ..."
Abstract
-
Cited by 131 (2 self)
- Add to MetaCart
Discovering sequential patterns is an important problem in data mining with a host of application domains including medicine, telecommunications, and the World Wide Web. Conventional
Efficient data mining for path traversal patterns
- IEEE Transactions on Knowledge and Data Engineering
, 1998
"... Abstract—In this paper, we explore a new data mining capability that involves mining path traversal patterns in a distributed information-providing environment where documents or objects are linked together to facilitate interactive access. Our solution procedure consists of two steps. First, we der ..."
Abstract
-
Cited by 128 (10 self)
- Add to MetaCart
Abstract—In this paper, we explore a new data mining capability that involves mining path traversal patterns in a distributed information-providing environment where documents or objects are linked together to facilitate interactive access. Our solution procedure consists of two steps. First, we derive an algorithm to convert the original sequence of log data into a set of maximal forward references. By doing so, we can filter out the effect of some backward references, which are mainly made for ease of traveling and concentrate on mining meaningful user access sequences. Second, we derive algorithms to determine the frequent traversal patterns¦i.e., large reference sequences¦from the maximal forward references obtained. Two algorithms are devised for determining large reference sequences; one is based on some hashing and pruning techniques, and the other is further improved with the option of determining large reference sequences in batch so as to reduce the number of database scans required. Performance of these two methods is comparatively analyzed. It is shown that the option of selective scan is very advantageous and can lead to prominent performance improvement. Sensitivity analysis on various parameters is conducted. Index Terms—Data mining, traversal patterns, distributed information system, World Wide Web, performance analysis.
Data Mining for Path Traversal Patterns in a Web Environment
, 1996
"... In this paper, we explore a new data mining capability which involves mining path traversal patterns in a distributed information providing environment like world-wide-web. First, we convert the original sequence of log data into a set of maximal forward references and filter out the effect of some ..."
Abstract
-
Cited by 98 (1 self)
- Add to MetaCart
In this paper, we explore a new data mining capability which involves mining path traversal patterns in a distributed information providing environment like world-wide-web. First, we convert the original sequence of log data into a set of maximal forward references and filter out the effect of some backward references which are mainly made for ease of traveling. Second, we derive algorithms to determine the frequent traversal patterns, i.e., large reference sequences, from the maximal forward references obtained. Two algorithms are devised for determining large reference sequences: one is based on some hashing and pruning techniques, and the other is further improved with the option of determining large reference sequences in batch so as to reduce the number of database scans required. Performance of these two methods is comparatively analyzed.
Using a Hash-Based Method with Transaction Trimming and Database Scan Reduction for Mining Association Rules
- IEEE Transactions on Knowledge and Data Engineering
, 1997
"... In this paper, we examine the issue of mining association rules among items in a large database of sales transactions. Mining association rules means that given a database of sales transactions, to discover all associations among items such that the presence of some items in a transaction will imply ..."
Abstract
-
Cited by 58 (8 self)
- Add to MetaCart
In this paper, we examine the issue of mining association rules among items in a large database of sales transactions. Mining association rules means that given a database of sales transactions, to discover all associations among items such that the presence of some items in a transaction will imply the presence of other items in the same transaction. The mining of association rules can be mapped into the problem of discovering large itemsets where a large itemset is a group of items which appear in a sufficient number of transactions. The problem of discovering large itemsets can be solved by constructing a candidate set of itemsets first and then, identifying, within this candidate set, those itemsets that meet the large itemset requirement. Generally this is done iteratively for each large k-itemset in increasing order of k where a large k-itemset is a large itemset with k items. To determine large itemsets from a huge number of candidate large itemsets in early iterations is usual...
Comparing hierarchical data in external memory
- In 25th Very Large Data Base Conference (VLDB
, 1999
"... We present an external-memory algorithm for computing a minimum-cost edit script between two rooted, ordered, labeled trees. The I/O, RAM, and CPU costs of our algorithm are, respectively, 4mn+7m+5n, 6S, andO(MN+(M+N)S1:5), whereMandNare the input tree sizes,Sis the block size,m=M=S, andn=N=S. This ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
We present an external-memory algorithm for computing a minimum-cost edit script between two rooted, ordered, labeled trees. The I/O, RAM, and CPU costs of our algorithm are, respectively, 4mn+7m+5n, 6S, andO(MN+(M+N)S1:5), whereMandNare the input tree sizes,Sis the block size,m=M=S, andn=N=S. This algorithm can make effective use of surplus RAM capacity to quadratically reduce I/O cost. We extend to trees the commonly used mapping from sequence comparison problems to shortest-path problems in edit graphs. 1

