Results 1  10
of
176
Fast Algorithms for Mining Association Rules
, 1994
"... We consider the problem of discovering association rules between items in a large database of sales transactions. We present two new algorithms for solving this problem that are fundamentally different from the known algorithms. Empirical evaluation shows that these algorithms outperform the known a ..."
Abstract

Cited by 3612 (15 self)
 Add to MetaCart
We consider the problem of discovering association rules between items in a large database of sales transactions. We present two new algorithms for solving this problem that are fundamentally different from the known algorithms. Empirical evaluation shows that these algorithms outperform the known algorithms by factors ranging from three for small problems to more than an order of magnitude for large problems. We also show how the best features of the two proposed algorithms can be combined into a hybrid algorithm, called AprioriHybrid. Scaleup experiments show that AprioriHybrid scales linearly with the number of transactions. AprioriHybrid also has excellent scaleup properties with respect to the transaction size and the number of items in the database.
An efficient algorithm for mining association rules in large databases
, 1995
"... Mining for a.ssociation rules between items in a large database of sales transactions has been described as an important database mining problem. In this paper we present an efficient algorithm for mining association rules that is fundamentally different from known algorithms. Compared to previous ..."
Abstract

Cited by 437 (0 self)
 Add to MetaCart
Mining for a.ssociation rules between items in a large database of sales transactions has been described as an important database mining problem. In this paper we present an efficient algorithm for mining association rules that is fundamentally different from known algorithms. Compared to previous algorithms, our algorithm not only reduces the I/O overhead significantly but also has lower CPU overhead for most cases. We have performed extensive experiments and compared the performance of our algorithm with one of the best existing algorithms. It was found that for large databases, the CPU overhead was reduced by as much as a factor of four and I/O was reduced by almost an order of magnitude. Hence this algorithm is especially suitable for very large size databases. 1
Database Mining: A Performance Perspective
 IEEE Transactions on Knowledge and Data Engineering
, 1993
"... We present our perspective of database mining as the confluence of machine learning techniques and the performance emphasis of database technology. We describe three classes of database mining problems involving classification, associations, and sequences, and argue that these problems can be unifor ..."
Abstract

Cited by 345 (13 self)
 Add to MetaCart
(Show Context)
We present our perspective of database mining as the confluence of machine learning techniques and the performance emphasis of database technology. We describe three classes of database mining problems involving classification, associations, and sequences, and argue that these problems can be uniformly viewed as requiring discovery of rules embedded in massive data. We describe a model and some basic operations for the process of rule discovery. We show how the database mining problems we consider map to this model and how they can be solved by using the basic operations we propose. We give an example of an algorithm for classification obtained by combining the basic rule discovery operations. This algorithm not only is efficient in discovering classification rules but also has accuracy comparable to ID3, one of the current best classifiers. Index Terms. database mining, knowledge discovery, classification, associations, sequences, decision trees Current address: Computer Science De...
An effective hashbased algorithm for mining association rules
, 1995
"... In this paper, we examine the issue of mining association rules among items in a large database of sales transactions. The mining of association rules can be mapped into the problem of discovering large itemsets where a large itemset is a group of items which appear in a sufficient number of transac ..."
Abstract

Cited by 283 (3 self)
 Add to MetaCart
(Show Context)
In this paper, we examine the issue of mining association rules among items in a large database of sales transactions. The mining of association rules can be mapped into the problem of discovering large itemsets where a large itemset is a group of items which appear in a sufficient number of transactions. The problem of discovering large itemsets can be solved by constructing a candidate set of itemsets first and then, identifying, within this candidate set, those itemsets that meet the large itemset requirement. Generally this is done iteratively for each large kitemset in increasing order of k where a large kitemset is a large itemset with k items. To determine large itemsets from a huge number of candidate large itemsets in early iterations is usually the dominating factor for the overall data mining performance. To address this issue, we propose an effective hashbased algorithm for the candidate set generation. Explicitly, the number of candidate 2itemsets generated by the proposed algorithm is, in orders of magnitude, smaller than that by previous methods, thus resolving the performance bottleneck. Note that the generation of smaller candidate sets enables us to effectively trim the transaction database size at a much earlier stage of the iterations, thereby reducing the computational cost for later iterations significantly. Extensive simulation study is conducted to evaluate performance of the proposed algorithm. 1
Finding Interesting Rules from Large Sets of Discovered Association Rules
, 1994
"... Association rules, introduced by Agrawal, Imielinski, and Swami, are rules of the form "for 90 % of the rows of the relation, if the row has value 1 in the columns in set W , then it has 1 also in column B". Efficient methods exist for discovering association rules from large collections o ..."
Abstract

Cited by 240 (9 self)
 Add to MetaCart
(Show Context)
Association rules, introduced by Agrawal, Imielinski, and Swami, are rules of the form "for 90 % of the rows of the relation, if the row has value 1 in the columns in set W , then it has 1 also in column B". Efficient methods exist for discovering association rules from large collections of data. The number of discovered rules can, however, be so large that browsing the rule set and finding interesting rules from it can be quite difficult for the user. We show how a simple formalism of rule templates makes it possible to easily describe the structure of interesting rules. We also give examples of visualization of rules, and show how a visualization tool interfaces with rule templates. 1 Introduction Data mining (knowledge discovery in databases) is a field of increasing interest combining databases, artificial intelligence, and machine learning. The purpose of data mining is to facilitate understanding large amounts of data by discovering interesting regularities or exceptions (see e...
Efficient data mining for path traversal patterns
 IEEE Transactions on Knowledge and Data Engineering
, 1998
"... Abstract—In this paper, we explore a new data mining capability that involves mining path traversal patterns in a distributed informationproviding environment where documents or objects are linked together to facilitate interactive access. Our solution procedure consists of two steps. First, we der ..."
Abstract

Cited by 217 (16 self)
 Add to MetaCart
Abstract—In this paper, we explore a new data mining capability that involves mining path traversal patterns in a distributed informationproviding environment where documents or objects are linked together to facilitate interactive access. Our solution procedure consists of two steps. First, we derive an algorithm to convert the original sequence of log data into a set of maximal forward references. By doing so, we can filter out the effect of some backward references, which are mainly made for ease of traveling and concentrate on mining meaningful user access sequences. Second, we derive algorithms to determine the frequent traversal patterns¦i.e., large reference sequences¦from the maximal forward references obtained. Two algorithms are devised for determining large reference sequences; one is based on some hashing and pruning techniques, and the other is further improved with the option of determining large reference sequences in batch so as to reduce the number of database scans required. Performance of these two methods is comparatively analyzed. It is shown that the option of selective scan is very advantageous and can lead to prominent performance improvement. Sensitivity analysis on various parameters is conducted. Index Terms—Data mining, traversal patterns, distributed information system, World Wide Web, performance analysis.
ProbView: A Flexible Probabilistic Database System
 ACM TRANSACTIONS ON DATABASE SYSTEMS
, 1997
"... ... In this article, we characterize, using postulates, whole classes of strategies for conjunction, disjunction, and negation, meaningful from the viewpoint of probability theory. (1) We propose a probabilistic relational data model and a generic probabilistic relational algebra that neatly capture ..."
Abstract

Cited by 202 (14 self)
 Add to MetaCart
... In this article, we characterize, using postulates, whole classes of strategies for conjunction, disjunction, and negation, meaningful from the viewpoint of probability theory. (1) We propose a probabilistic relational data model and a generic probabilistic relational algebra that neatly captures various strategies satisfying the postulates, within a single unified framework. (2) We show that as long as the chosen strategies can be computed in polynomial time, queries in the positive fragment of the probabilistic relational algebra have essentially the same data complexity as classical relational algebra. (3) We establish various containments and equivalences between algebraic expressions, similar in spirit to those in classical algebra. (4) We develop algorithms for maintaining materialized probabilistic views. (5) Based on these ideas, we have developed
CLARANS: A Method for Clustering Objects for Spatial Data Mining
 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2002
"... Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. To this end, this paper has three main contributions. First, we propose a new clustering method called CLARANS, whose aim is to identify spatial structures that may ..."
Abstract

Cited by 142 (0 self)
 Add to MetaCart
(Show Context)
Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. To this end, this paper has three main contributions. First, we propose a new clustering method called CLARANS, whose aim is to identify spatial structures that may be present in the data. Experimental results indicate that, when compared with existing clustering methods, CLARANS is very efficient and effective. Second, we investigate how CLARANS can handle not only points objects, but also polygon objects efficiently. One of the methods considered, called the IRapproximation, is very efficient in clustering convex and nonconvex polygon objects. Third, building on top of CLARANS, we develop two spatial data mining algorithms that aim to discover relationships between spatial and nonspatial attributes. Both algorithms can discover knowledge that is difficult to find with existing spatial data mining algorithms.
Data Mining for Path Traversal Patterns in a Web Environment
, 1996
"... In this paper, we explore a new data mining capability which involves mining path traversal patterns in a distributed information providing environment like worldwideweb. First, we convert the original sequence of log data into a set of maximal forward references and filter out the effect of some ..."
Abstract

Cited by 130 (1 self)
 Add to MetaCart
In this paper, we explore a new data mining capability which involves mining path traversal patterns in a distributed information providing environment like worldwideweb. First, we convert the original sequence of log data into a set of maximal forward references and filter out the effect of some backward references which are mainly made for ease of traveling. Second, we derive algorithms to determine the frequent traversal patterns, i.e., large reference sequences, from the maximal forward references obtained. Two algorithms are devised for determining large reference sequences: one is based on some hashing and pruning techniques, and the other is further improved with the option of determining large reference sequences in batch so as to reduce the number of database scans required. Performance of these two methods is comparatively analyzed.
Data mining using twodimensional optimized association rules: Scheme, algorithms, and visualization
, 1996
"... We discuss data mining based on association rules for two numeric attributes and one Boolean attribute. For example, in a database of bank customers, “Age ” and “Balance” are two numeric attributes, and “CardLoan ” is a Boolean attribute. Taking the pair (Age, Balance) as a point in twodimensional ..."
Abstract

Cited by 128 (9 self)
 Add to MetaCart
We discuss data mining based on association rules for two numeric attributes and one Boolean attribute. For example, in a database of bank customers, “Age ” and “Balance” are two numeric attributes, and “CardLoan ” is a Boolean attribute. Taking the pair (Age, Balance) as a point in twodimensional space, we consider an association rule of the form ((Age, Balance) c P) * (CardLoan = Yes), which implies that bank customers whose ages and balances fall in a planar region P tend to use card loan with a high probability. We consider two classes of regions, rectangles and adrmssible (i.e. connected and zmonotone) regions. For each class, we propose efficient algorithms for computing the regions that give optimal association rules for gain, support, and confidence, respectively. We have implemented the algorithms for admissible regions, and constructed a system for visualizing the rules. 1