Results 1 -
3 of
3
Sampling Large Databases for Association Rules
, 1996
"... Discovery of association rules is an important database mining problem. Current algorithms for nding association rules require several passes over the analyzed database, and obviously the role of I/O overhead is very signi cant for very large databases. We present new algorithms that reduce the data ..."
Abstract
-
Cited by 330 (4 self)
- Add to MetaCart
Discovery of association rules is an important database mining problem. Current algorithms for nding association rules require several passes over the analyzed database, and obviously the role of I/O overhead is very signi cant for very large databases. We present new algorithms that reduce the database activity considerably. Theidea is to pick a random sample, to ndusingthis sample all association rules that probably hold in the whole database, and then to verify the results with the restofthe database. The algorithms thus produce exact association rules, not approximations based on a sample. The approach is, however, probabilistic, and inthose rare cases where our sampling method does not produce all association rules, the missing rules can be found inasecond pass. Our experiments show that the proposed algorithms can nd association rules very e ciently in only onedatabase pass. 1
Mining optimized gain rules for numeric attributes
- Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 1999
"... Association rules are useful for determining correlations between attributes of a relation and have applications in marketing, nancial and retail sectors. Furthermore, optimized association rules are an e ective way to focus on the most interesting characteristics involving certain attributes. Optim ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Association rules are useful for determining correlations between attributes of a relation and have applications in marketing, nancial and retail sectors. Furthermore, optimized association rules are an e ective way to focus on the most interesting characteristics involving certain attributes. Optimized association rules are permitted to contain uninstantiated attributes and the problem is to determine instantiations such that either the support, con dence or gain of the rule is maximized. In this paper, we generalize the optimized gain association rule problem by permitting rules to contain disjunctions over uninstantiated numeric attributes. Our generalized association rules enable us to extract more useful information about seasonal and local patterns involving the uninstantiated attribute. For rules containing a single numeric attribute, we present an algorithm with linear complexity for computing optimized gain rules. Furthermore, we propose a bucketing technique that can result in a signi cant reduction in input size by coalescing contiguous values without sacri cing optimality. We also present an approximation algorithm based on dynamic programming for two numeric attributes. Using recent results on binary space partitioning trees, we show that the approximations are within a constant factor of the optimal optimized gain rules. Our experimental results with synthetic data sets for a single numeric attribute demonstrate that our algorithm scales up linearly with the attribute's domain size as well as the number of disjunctions. In addition, we show that applying our optimized rule framework to a population survey real-life data set enables us to discover interesting underlying correlations among the attributes.
unknown title
"... Discovery of association rules is an important data mining task. Several algorithms have been proposed to solve this problem. Most of them require repeated passes over the database, which incurs huge I/O overhead and high synchronization expense in parallel cases. There are a few algorithms trying t ..."
Abstract
- Add to MetaCart
Discovery of association rules is an important data mining task. Several algorithms have been proposed to solve this problem. Most of them require repeated passes over the database, which incurs huge I/O overhead and high synchronization expense in parallel cases. There are a few algorithms trying to reduce these costs. But they contains weaknesses such as often requiring high pre-processing cost to get a vertical database layout, containing much redundant computation in parallel cases, and so on. We propose new association mining algorithms to overcome the above drawbacks, through minimizing the I/O cost and effectively controlling the computation cost. Experiments on well-known synthetic data show that our algorithms consistently outperform Apriori, one of the best algorithms for association mining, by factors ranging from 2 to 4 in most cases. Also, our algorithms are very easy to be parallelized, and we present a parallelization for them based on a shared-nothing architecture. We observe that the parallelism in our parallel approach is developed more su ciently than in two of the best existing parallel algorithms.

