Results 1 - 10
of
90
Interestingness measures for data mining: a survey
- ACM Computing Surveys
"... Interestingness measures play an important role in data mining, regardless of the kind of patterns being mined. These measures are intended for selecting and ranking patterns according to their potential interest to the user. Good measures also allow the time and space costs of the mining process to ..."
Abstract
-
Cited by 158 (2 self)
- Add to MetaCart
Interestingness measures play an important role in data mining, regardless of the kind of patterns being mined. These measures are intended for selecting and ranking patterns according to their potential interest to the user. Good measures also allow the time and space costs of the mining process to be reduced. This survey reviews the interestingness measures for rules and summaries, classifies them from several perspectives, compares their properties, identifies their roles in the data mining process, gives strategies for selecting appropriate measures for applications, and identifies opportunities for future research in this area.
Detecting group differences: Mining contrast sets
- Data Mining and Knowledge Discovery
, 2001
"... A fundamental task in data analysis is understanding the differences between several con-trasting groups. These groups can represent different classes of objects, such as male or female students, or the same group over time, e.g. freshman students in 1993 through 1998. We present the problem of mini ..."
Abstract
-
Cited by 109 (3 self)
- Add to MetaCart
A fundamental task in data analysis is understanding the differences between several con-trasting groups. These groups can represent different classes of objects, such as male or female students, or the same group over time, e.g. freshman students in 1993 through 1998. We present the problem of mining contrast sets: conjunctions of attributes and values that differ meaningfully in their distribution across groups. We provide a search algorithm for mining contrast sets with pruning rules that drastically reduce the computational complexity. Once the contrast sets are found, we post-process the results to present a subset that are surprising to the user given what we have already shown. We explicitly control the probability of Type I error (false positives) and guarantee a maximum error rate for the entire analysis by using Bonferroni corrections.
Bayesian Network Anomaly Pattern Detection for Disease Outbreaks
- In Proceedings of the Twentieth International Conference on Machine Learning
, 2003
"... Early disease outbreak detection systems typically monitor health care data for irregularities by comparing the distribution of recent data against a baseline distribution. Determining the baseline is dicult due to the presence of dierent trends in health care data, such as trends caused by th ..."
Abstract
-
Cited by 48 (6 self)
- Add to MetaCart
Early disease outbreak detection systems typically monitor health care data for irregularities by comparing the distribution of recent data against a baseline distribution. Determining the baseline is dicult due to the presence of dierent trends in health care data, such as trends caused by the day of week and by seasonal variations in temperature and weather. Creating the baseline distribution without taking these trends into account can lead to unacceptably high false positive counts and slow detection times.
Rule-Based Anomaly Pattern Detection for Detecting Disease Outbreaks
- In Proceedings of the 18th National Conference on Artificial Intelligence
, 2002
"... This paper presents an algorithm for performing early detection of disease outbreaks by searching a database of emergency department cases for anomalous patterns. ..."
Abstract
-
Cited by 42 (4 self)
- Add to MetaCart
(Show Context)
This paper presents an algorithm for performing early detection of disease outbreaks by searching a database of emergency department cases for anomalous patterns.
TFP: An Efficient Algorithm for Mining Top-K Frequent Closed Itemsets
- IEEE Trans. on Knowledge and Data Engineering
, 2005
"... Abstract—Frequent itemset mining has been studied extensively in literature. Most previous studies require the specification of a min_support threshold and aim at mining a complete set of frequent itemsets satisfying min_support. However, in practice, it is difficult for users to provide an appropri ..."
Abstract
-
Cited by 31 (2 self)
- Add to MetaCart
(Show Context)
Abstract—Frequent itemset mining has been studied extensively in literature. Most previous studies require the specification of a min_support threshold and aim at mining a complete set of frequent itemsets satisfying min_support. However, in practice, it is difficult for users to provide an appropriate min_support threshold. In addition, a complete set of frequent itemsets is much less compact than a set of frequent closed itemsets. In this paper, we propose an alternative mining task: mining top-k frequent closed itemsets of length no less than min_l, where k is the desired number of frequent closed itemsets to be mined, and min_l is the minimal length of each itemset. An efficient algorithm, called TFP, is developed for mining such itemsets without mins_support. Starting at min_support = 0 and by making use of the length constraint and the properties of top-k frequent closed itemsets, min_support can be raised effectively and FP-Tree can be pruned dynamically both during and after the construction of the tree using our two proposed methods: the closed node count and descendant_sum. Moreover, mining is further speeded up by employing a top-down and bottom-up combined FP-Tree traversing strategy, a set of search space pruning methods, a fast 2-level hash-indexed result tree, and a novel closed itemset verification scheme. Our extensive performance study shows that TFP has high performance and linear scalability in terms of the database size. Index Terms—Data mining, frequent itemset, association rules, mining methods and algorithms. 1
Condensing Uncertainty via Incremental Treatment Learning
- ANNALS OF SOFTWARE ENGINEERING, SPECIAL ISSUE ON COMPUTATIONAL INTELLIGENCE. TO APPEAR.
, 2002
"... Models constrain the range of possible behaviors de£ned for a domain. When parts of a model are uncertain, the possible behaviors may be a data cloud: i.e. an overwhelming range of possibilities that bewilder an analyst. Faced with large data clouds, it is hard to demonstrate that any particular de ..."
Abstract
-
Cited by 24 (20 self)
- Add to MetaCart
Models constrain the range of possible behaviors de£ned for a domain. When parts of a model are uncertain, the possible behaviors may be a data cloud: i.e. an overwhelming range of possibilities that bewilder an analyst. Faced with large data clouds, it is hard to demonstrate that any particular decision leads to a particular outcome. Even if we can’t make de£nite decisions from such models, it is possible to £nd decisions that reduce the variance of values within a data cloud. Also, it is possible to change the range of these future behaviors such that the cloud condenses to some improved mode. Our approach uses two tools. Firstly, a model simulator is constructed that knows the range of possible values for uncertain parameters. Secondly, the TAR2 treatment learner uses the output from the simulator to incrementally learn better constraints. In our incremental treatment learning cycle, users review newly discovered treatments before they are added to a growing pool of constraints used by the model simulator.
Multivariate discretization of continuous variables for set mining
- In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2000
"... Many algorithms in data mining can be formulated as a set mining problem where the goal is to nd conjunctions (or disjunctions) of terms that meet user speci ed constraints. Set mining techniques have been largely designed for categorical or discrete data where variables can only take on a xed numbe ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
(Show Context)
Many algorithms in data mining can be formulated as a set mining problem where the goal is to nd conjunctions (or disjunctions) of terms that meet user speci ed constraints. Set mining techniques have been largely designed for categorical or discrete data where variables can only take on a xed numberofvalues. However, many data sets also contain continuous variables and a common method of dealing with these is to discretize them by breaking them into ranges. Most discretization methods are univariate and consider only a single feature at a time (sometimes in conjunction with the class variable). We argue that this is a sub-optimal approach for knowledge discovery as univariate discretization can destroy hidden patterns in data. Discretization should consider the e ects on all variables in the analysis and that two regions X and Y should only be in the same cell after discretization if the instances in those regions have similar multivariate distributions (Fx Fy) across all variables and combinations of variables. We present abottom up merging algorithm to discretize continuous variables based on this rule. Our experiments indicate that the approach is feasible, that it does not destroy hidden patterns and that it generates meaningful intervals.
Temporal sequence associations for rare events
- In Proceedings of PAKDD’04
, 2004
"... Abstract. In many real world applications, systematic analysis of rare events, such as credit card frauds and adverse drug reactions, is very important. Their low occurrence rate in large databases often makes it difficult to identify the risk factors from straightforward application of associations ..."
Abstract
-
Cited by 18 (9 self)
- Add to MetaCart
(Show Context)
Abstract. In many real world applications, systematic analysis of rare events, such as credit card frauds and adverse drug reactions, is very important. Their low occurrence rate in large databases often makes it difficult to identify the risk factors from straightforward application of associations and sequential pattern discovery. In this paper we introduce a heuristic to guide the search for interesting patterns associated with rare events from large temporal event sequences. Our approach combines association and sequential pattern discovery with a measure of risk borrowed from epidemiology to assess the interestingness of the discovered patterns. In the experiments, we successfully identify a known drug and several new drug combinations with high risk of adverse reactions. The approach is also applicable to other applications where rare events are of primary interest. 1
Optimizing Requirements Decisions With KEYS
, 2008
"... ... for external access to five of JPL’s real-world requirements models, anonymized to conceal proprietary information, but retaining their computational nature. Experimentation with these models, reported herein, demonstrates a dramatic speedup in the computations performed on them. These models ha ..."
Abstract
-
Cited by 17 (8 self)
- Add to MetaCart
... for external access to five of JPL’s real-world requirements models, anonymized to conceal proprietary information, but retaining their computational nature. Experimentation with these models, reported herein, demonstrates a dramatic speedup in the computations performed on them. These models have a well defined goal: select mitigations that retire risks which, in turn, increases the number of attainable requirements. Such a non-linear optimization is a well-studied problem. However identification of not only (a) the optimal solution(s) but also (b) the key factors leading to them is less well studied. Our technique, called KEYS, shows a rapid way of simultaneously identifying the solutions and their key factors. KEYS improves on prior work by several orders of magnitude. Prior experiments with simulated annealing or treatment learning took tens of minutes to hours to terminate. KEYS runs much faster than that; e.g for one model, KEYS ran 13,000 times faster than treatment learning (40 minutes versus 0.18 seconds). Processing these JPL models is a non-linear optimization problem: the fewest mitigations must be selected while achieving the most requirements. Non-linear optimization is a well studied problem. With this paper, we challenge other members of the PROMISE community to improve on our results with other techniques.
Characterizing model errors and differences
- Proceedings of the Seventeenth International Conference on Machine Learning
, 2000
"... A critical component of applying machine learning algorithms is evaluating the performance of the models induced and using the evaluation to guide further development. Traditionally the most common evaluation metric is error or loss, however this provides very little information for the designer to ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
(Show Context)
A critical component of applying machine learning algorithms is evaluating the performance of the models induced and using the evaluation to guide further development. Traditionally the most common evaluation metric is error or loss, however this provides very little information for the designer to use when constructing a system. We argue that an evaluation method should provide detailed feedback on the performance of an algorithm and that this feedback should be in the language of the problem: Our goal is to characterize model errors or the differences between models in the feature space. We provide a framework for this that allows different algorithms to be used as the discovery engine and we consider two approaches: (1) a classification strategy where we use a standard rule learner such as C5; (2) a descriptive paradigm where we use a new discovery algorithm: a contrast set miner. We show that C5 suffers from several problems that make it unsuitable for this task.