Results 1  10
of
58
Local probabilistic models for link prediction
 In ICDM
, 2007
"... One of the core tasks in social network analysis is to predict the formation of links (i.e. various types of relationships) over time. Previous research has generally represented the social network in the form of a graph and has leveraged topological and semantic measures of similarity between tw ..."
Abstract

Cited by 56 (0 self)
 Add to MetaCart
(Show Context)
One of the core tasks in social network analysis is to predict the formation of links (i.e. various types of relationships) over time. Previous research has generally represented the social network in the form of a graph and has leveraged topological and semantic measures of similarity between two nodes to evaluate the probability of link formation. Here we introduce a novel local probabilistic graphical model method that can scale to large graphs to estimate the joint cooccurrence probability of two nodes. Such a probability measure captures information that is not captured by either topological measures or measures of semantic similarity, which are the dominant measures used for link prediction. We demonstrate the effectiveness of the cooccurrence probability feature by using it both in isolation and in combination with other topological and semantic features for predicting coauthorship collaborations on three real datasets. 1
A survey on condensed representations for frequent sets
 In: Constraint Based Mining and Inductive Databases, SpringerVerlag, LNAI
, 2005
"... Abstract. Solving inductive queries which have to return complete collections of patterns satisfying a given predicate has been studied extensively the last few years. The specific problem of frequent set mining from potentially huge boolean matrices has given rise to tens of efficient solvers. Freq ..."
Abstract

Cited by 38 (4 self)
 Add to MetaCart
(Show Context)
Abstract. Solving inductive queries which have to return complete collections of patterns satisfying a given predicate has been studied extensively the last few years. The specific problem of frequent set mining from potentially huge boolean matrices has given rise to tens of efficient solvers. Frequent sets are indeed useful for many data mining tasks, including the popular association rule mining task but also feature construction, associationbased classification, clustering, etc. The research in this area has been boosted by the fascinating concept of condensed representations w.r.t. frequency queries. Such representations can be used to support the discovery of every frequent set and its support without looking back at the data. Interestingly, the size of condensed representations can be several orders of magnitude smaller than the size of frequent set collections. Most of the proposals concern exact representations while it is also possible to consider approximated ones, i.e., to trade computational complexity with a bounded approximation on the computed support values. This paper surveys the core concepts used in the recent works on condensed representation for frequent sets. 1
Mining Statistically Important Equivalence Classes and DeltaDiscriminative Emerging Patterns
, 2007
"... The supportconfidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In thi ..."
Abstract

Cited by 38 (2 self)
 Add to MetaCart
The supportconfidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In this paper, we introduce a novel algorithm that produces itemsets with ranked statistical merits under sophisticated test statistics such as chisquare, risk ratio, odds ratio, etc. Our algorithm is based on the concept of equivalence classes. An equivalence class is a set of frequent itemsets that always occur together in the same set of transactions. Therefore, itemsets within an equivalence class all share the same level of statistical significance regardless of the variety of test statistics. As an equivalence class can be uniquely determined and concisely represented by a closed pattern and a set of generators, we just mine closed patterns and generators, taking a simultaneous depthfirst search scheme. This parallel approach has not been exploited by any prior work. We evaluate our algorithm on two aspects. In general, we compare to LCM and FPclose which are the best algorithms tailored for mining only closed patterns. In particular, we compare to epMiner which is the most recent algorithm for mining a type of relative risk patterns, known as minimal emerging patterns. Experimental results show that our algorithm is faster than all of them, sometimes even multiple orders of magnitude faster. These statistically ranked patterns and the efficiency have a high potential for reallife applications, especially in biomedical and financial fields where classical test statistics are of dominant interest.
Summarizing Itemset Patterns Using Probabilistic Models
, 2006
"... In this paper, we propose a novel probabilistic approach to summarize frequent itemset patterns. Such techniques are useful for summarization, postprocessing, and enduser interpretation, particularly for problems where the resulting set of patterns are huge. In our approach items in the dataset ar ..."
Abstract

Cited by 31 (2 self)
 Add to MetaCart
(Show Context)
In this paper, we propose a novel probabilistic approach to summarize frequent itemset patterns. Such techniques are useful for summarization, postprocessing, and enduser interpretation, particularly for problems where the resulting set of patterns are huge. In our approach items in the dataset are modeled as random variables. We then construct a Markov Random Fields (MRF) on these variables based on frequent itemsets and their occurrence statistics. The summarization proceeds in a levelwise iterative fashion. Occurrence statistics of itemsets at the lowest level are used to construct an initial MRF. Statistics of itemsets at the next level can then be inferred from the model. We use those patterns whose occurrence can not be accurately inferred from the model to augment the model in an iterative manner, repeating the procedure until all frequent itemsets can be modeled. The resulting MRF model affords a concise and useful representation of the original collection of itemsets. Extensive empirical study on real datasets show that the new approach can effectively summarize a large number of itemsets and typically significantly outperforms extant approaches.
Mining rankcorrelated sets of numerical attributes
 In KDD’06
, 2006
"... We study the mining of interesting patterns in the presence of numerical attributes. Instead of the usual discretization methods, we propose the use of rank based measures to score the similarity of sets of numerical attributes. New support measures for numerical data are introduced, based on extens ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
(Show Context)
We study the mining of interesting patterns in the presence of numerical attributes. Instead of the usual discretization methods, we propose the use of rank based measures to score the similarity of sets of numerical attributes. New support measures for numerical data are introduced, based on extensions of Kendall’s tau, and Spearman’s Footrule and rho. We show how these support measures are related. Furthermore, we introduce a novel type of pattern combining numerical and categorical attributes. We give efficient algorithms to find all frequent patterns for the proposed support measures, and evaluate their performance on reallife datasets.
Tell me what I need to know: Succinctly summarizing data with itemsets
 In Proc. KDD
, 2011
"... Data analysis is an inherently iterative process. That is, what we know about the data greatly determines our expectations, and hence, what result we would find the most interesting. With this in mind, we introduce a wellfounded approach for succinctly summarizing data with a collection of itemsets ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
(Show Context)
Data analysis is an inherently iterative process. That is, what we know about the data greatly determines our expectations, and hence, what result we would find the most interesting. With this in mind, we introduce a wellfounded approach for succinctly summarizing data with a collection of itemsets; using a probabilistic maximum entropy model, we iteratively find the most interesting itemset, and in turn update our model of the data accordingly. As we only include itemsets that are surprising with regard to the current model, the summary is guaranteed to be both descriptive and nonredundant. The algorithm that we present can either mine the topk most interesting itemsets, or use the Bayesian Information Criterion to automatically identify the model containing only the itemsets most important for describing the data. Or, in other words, it will ‘tell you what you need to know’. Experiments on synthetic and benchmark data show that the discovered summaries are succinct, and correctly identify the key patterns in the data. The models they form attain high likelihoods, and inspection shows that they summarize the data well with increasingly specific, yet nonredundant itemsets.
Tell me something I don’t know: randomization strategies for iterative data mining.
 In Proc. KDD’09,
, 2009
"... ABSTRACT There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by t ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
(Show Context)
ABSTRACT There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure. In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or cooccurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.
Directly Mining Descriptive Patterns
 SIAM SDM
, 2012
"... Mining small, useful, and highquality sets of patterns has recently become an important topic in data mining. The standard approach is to first mine many candidates, and then to select a good subset. However, the pattern explosion generates such enormous amounts of candidates that by postprocessin ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
(Show Context)
Mining small, useful, and highquality sets of patterns has recently become an important topic in data mining. The standard approach is to first mine many candidates, and then to select a good subset. However, the pattern explosion generates such enormous amounts of candidates that by postprocessing it is virtually impossible to analyse dense or large databases in any detail. We introduce Slim, an anytime algorithm for mining highquality sets of itemsets directly from data. We use MDL to identify the best set of itemsets as that set that describes the data best. To approximate this optimum, we iteratively use the current solution to determine what itemset would provide most gain— estimating quality using an accurate heuristic. Without requiring a premined candidate collection, Slim is parameterfree in both theory and practice. Experiments show we mine highquality pattern sets; while evaluating ordersofmagnitude fewer candidates than our closest competitor, Krimp, we obtain much better compression ratios—closely approximating the locallyoptimal strategy. Classification experiments independently verify we characterise data very well. 1
ZART: A Multifunctional Itemset Mining Algorithm
"... Abstract. In this paper 3, we present platform Coron, which is a domain independent, multipurposed data mining platform, incorporating a rich collection of data mining algorithms. One of these algorithms is a multifunctional itemset mining algorithm called Zart, which is based on the Pascal algorit ..."
Abstract

Cited by 13 (8 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper 3, we present platform Coron, which is a domain independent, multipurposed data mining platform, incorporating a rich collection of data mining algorithms. One of these algorithms is a multifunctional itemset mining algorithm called Zart, which is based on the Pascal algorithm, with some additional features. In particular, Zart is able to perform the following, usually independent, tasks: identify frequent closed itemsets and associate generators to their closures. This allows one to find minimal nonredundant association rules. At present, Coron appears to be an original working platform, integrating efficient algorithms for both itemset and association rule extraction, allowing a number of auxiliary operations for preparing and filtering data, and, for interpreting the extracted units of knowledge. 1
Revisiting Numerical Pattern Mining with Formal Concept Analysis
"... We propose a definition of interval patterns for numerical data. Intuitively, each object of a numerical dataset correinria00584371, ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
We propose a definition of interval patterns for numerical data. Intuitively, each object of a numerical dataset correinria00584371,