Results 1 -
2 of
2
Sampling-based Data Mining Algorithms: Modern Techniques and Case Studies
"... Abstract. Sampling a dataset for faster analysis and looking at it as a sample from an unknown distribution are two faces of the same coin. We discuss the use of modern techniques involving the Vapnik-Chervonenkis (VC) dimension to study the trade-off between sample size and accuracy of data mining ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Sampling a dataset for faster analysis and looking at it as a sample from an unknown distribution are two faces of the same coin. We discuss the use of modern techniques involving the Vapnik-Chervonenkis (VC) dimension to study the trade-off between sample size and accuracy of data mining results that can be obtained from a sample. We report two case studies where we and collaborators employed these techniques to develop efficient sampling-based algorithms for the problems of between-ness centrality computation in large graphs and extracting statistically significant Frequent Itemsets from transactional datasets. 1 Sampling the data and data as samples There exist two possible uses of sampling in data mining. On the one hand, sampling means selecting a small random portion of the data, which will then be given as input to an algorithm. The output will be an approximation of the results that would have been obtained if all available data was analyzed but, thanks the the small size of the selected portion, the approximation could be obtained much more quickly. On the other hand, from a more statistically-inclined point of view, the entire dataset can be seen as a collection of samples from an unknown distribution. In this case the goal of analyzing the data is to gain a better understanding of the unknown distribution. Both scenarios share the same underlying question: how well does the sample resemble the entire dataset or the unknown distribution? There is a trade-off between the size of the sample and the quality of the approximation that can be obtained from it. Given the randomness involved in the sampling process, this trade-off must be studied in a probabilistic setting. In this nectar paper we discuss the use of techniques related to the Vapnik-Chervonenkis (VC) dimension of the problem at hand to analyze the trade-off between sample size and approximation quality and we report two case studies where we and collaborators successfully employed these techniques to develop efficient algorithms for the problems of betweenness centrality computation in large graphs [8] (“sampling the data” scenario) and extracting statistically significant frequent itemsets [10] (“data as samples ” scenario).
Mining Frequent Itemsets through Progressive Sampling with Rademacher Averages∗
, 2015
"... We present an algorithm to extract an high-quality approximation of the (top-k) Frequent itemsets (FIs) from random samples of a transactional dataset. With high probability the approximation is a superset of the FIs, and no itemset with frequency much lower than the threshold is included in it. The ..."
Abstract
- Add to MetaCart
We present an algorithm to extract an high-quality approximation of the (top-k) Frequent itemsets (FIs) from random samples of a transactional dataset. With high probability the approximation is a superset of the FIs, and no itemset with frequency much lower than the threshold is included in it. The algorithm employs progressive sampling, with a stopping condition based on bounds to the empirical Rademacher average, a key concept from statistical learning theory. The computation of the bounds uses characteristic quantities that can be obtained efficiently with a single scan of the sample. Therefore, evaluating the stopping condition is fast, and does not require an expensive mining of each sample. Our experimental evaluation confirms the practicality of our approach on real datasets, outperforming approaches based on one-shot static sampling. 1