Results 11  20
of
127
Blocking anonymity threats raised by frequent itemset mining
 In ICDM
, 2005
"... In this paper we study when the disclosure of data mining results represents, per se, a threat to the anonymity of the individuals recorded in the analyzed database. The novelty of our approach is that we focus on an objective definition of privacy compliance of patterns without any reference to a p ..."
Abstract

Cited by 28 (3 self)
 Add to MetaCart
In this paper we study when the disclosure of data mining results represents, per se, a threat to the anonymity of the individuals recorded in the analyzed database. The novelty of our approach is that we focus on an objective definition of privacy compliance of patterns without any reference to a preconceived knowledge of what is sensitive and what is not, on the basis of the rather intuitive and realistic constraint that the anonymity of individuals should be guaranteed. In particular, the problem addressed here arises from the possibility of inferring from the output of frequent itemset mining (i.e., a set of itemsets with support larger than a threshold σ), the existence of patterns with very low support (smaller than an anonymity threshold k)[3]. In the following we develop a simple methodology to block such inference opportunities by introducing distortion on the dangerous patterns. 1.
Discovering Shared Conceptualizations in Folksonomies
"... Social bookmark tools are rapidly emerging on the Web. In such systems users are setting up lightweight conceptual structures called folksonomies. Unlike ontologies, shared conceptualisations are not formalised, but rather implicit. We present a new data mining task, the \emph{mining of all frequen ..."
Abstract

Cited by 27 (0 self)
 Add to MetaCart
Social bookmark tools are rapidly emerging on the Web. In such systems users are setting up lightweight conceptual structures called folksonomies. Unlike ontologies, shared conceptualisations are not formalised, but rather implicit. We present a new data mining task, the \emph{mining of all frequent triconcepts}, together with an efficient algorithm, for discovering these implicit shared conceptualisations. Our approach extends the data mining task of discovering all closed itemsets to threedimensional data structures to allow for mining folksonomies. We provide a formal definition of the problem, and present an efficient algorithm for its solution. Finally, we show the applicability of our approach on three large realworld examples.
Anonymity preserving pattern discovery
 The International Journal on Very Large Data Bases (VLDB
, 2008
"... Towards lowperturbation anonymity preserving pattern discovery ..."
Abstract

Cited by 26 (4 self)
 Add to MetaCart
(Show Context)
Towards lowperturbation anonymity preserving pattern discovery
Summarization  compressing data into an informative representation
 in Proc. 5th IEEE Int'l Conf. Data Mining (ICDM), IEEE CS
, 2005
"... In this paper, we formulate the problem of summarization of a dataset of transactions with categorical attributes as an optimization problem involving two objective functions compaction gain and information loss. We propose metrics to characterize the output of any summarization algorithm. We inves ..."
Abstract

Cited by 26 (3 self)
 Add to MetaCart
(Show Context)
In this paper, we formulate the problem of summarization of a dataset of transactions with categorical attributes as an optimization problem involving two objective functions compaction gain and information loss. We propose metrics to characterize the output of any summarization algorithm. We investigate two approaches to address this problem. The first approach is an adaptation of clustering and the second approach makes use of frequent itemsets from the association analysis domain. We illustrate one application of summarization in the field of network data where we show how our technique can be effectively used to summarize network traffic into a compact but meaningful representation. Specifically, we evaluate our proposed algorithms on the 1998 DARPA Offline Intrusion Detection Evaluation data and network data generated by SKAION Corp for the ARDA information assurance program. 1
On Inverse Frequent Set Mining
, 2003
"... Frequent set mining is a wellknown technique to summarize binary data. However, it is an open problem how difficult it is to invert the frequent set mining, i.e., how difficult it is to find a binary data set that is compatible with frequent set mining results, the frequent sets. This inverse data ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
Frequent set mining is a wellknown technique to summarize binary data. However, it is an open problem how difficult it is to invert the frequent set mining, i.e., how difficult it is to find a binary data set that is compatible with frequent set mining results, the frequent sets. This inverse data mining problem is related to the questions of how well privacy is preserved in the frequent sets and how well the frequent sets characterize the original data set. In this paper we analyze the computational complexity of the problem of finding a binary data set compatible with a given collection of frequent sets and show that in many cases the problem is computationally very difficult.
A Survey on Algorithms for Mining Frequent Itemsets over Data Streams
"... The increasing prominence of data streams arising in a wide range of advanced applications such as fraud detection and trend learning has led to the study of online mining of frequent itemsets (FIs). Unlike mining static databases, mining data streams poses many new challenges. In addition to the on ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
(Show Context)
The increasing prominence of data streams arising in a wide range of advanced applications such as fraud detection and trend learning has led to the study of online mining of frequent itemsets (FIs). Unlike mining static databases, mining data streams poses many new challenges. In addition to the onescan nature, the unbounded memory requirement and the high data arrival rate of data streams, the combinatorial explosion of itemsets exacerbates the mining task. The high complexity of the FI mining problem hinders the application of the stream mining techniques. We recognize that a critical review of existing techniques is needed in order to design and develop efficient mining algorithms and data structures that are able to match the processing rate of the mining with the high arrival rate of data streams. Within a unifying set of notations and terminologies, we describe in this paper the efforts and main techniques for mining data streams and present a comprehensive survey of a number of the stateoftheart algorithms on mining frequent itemsets over data streams. We classify the streammining techniques into two categories based on the window model that they adopt in order to provide insights into how and why the techniques are useful. Then, we further analyze the algorithms according to whether they are exact or approximate and, for approximate approaches, whether they are falsepositive or falsenegative. We also discuss various interesting issues, including the merits and limitations in existing research and substantive areas for future research. 1 1
Geometric and Combinatorial Tiles in 01 Data
 In: Proceedings PKDD’04. Volume 3202 of LNAI
, 2004
"... In this paper we introduce a simple probabilistic model, hierarchical tiles, for 01 data. A basic tile (X,Y,p) specifies a subset X of the rows and a subset Y of the columns of the data, i.e., a rectangle, and gives a probability p for the occurrence of 1s in the cells of X x Y. A hierarchical tile ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
(Show Context)
In this paper we introduce a simple probabilistic model, hierarchical tiles, for 01 data. A basic tile (X,Y,p) specifies a subset X of the rows and a subset Y of the columns of the data, i.e., a rectangle, and gives a probability p for the occurrence of 1s in the cells of X x Y. A hierarchical tile has additionally a set of exception tiles that specify the probabilities for subrectangles of the original rectangle. If the rows and columns are ordered and X and Y consist of consecutive elements in those orderings, then the tile is geometric; otherwise it is combinatorial. We give a simple randomized algorithm for finding good geometric tiles. Our main result shows that using spectral ordering techniques one can find good orderings that turn combinatorial tiles into geometric tiles. We give empirical results on the performance of the methods.
Dense Itemsets
 In SIGKDD 2004
, 2004
"... Frequent itemset mining has been the subject of a lot of work in data mining research ever since association rules were introduced. In this paper we address a problem with frequent itemsets: that they only count rows where all their attributes are present, and do not allow for any noise. We show tha ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
Frequent itemset mining has been the subject of a lot of work in data mining research ever since association rules were introduced. In this paper we address a problem with frequent itemsets: that they only count rows where all their attributes are present, and do not allow for any noise. We show that generalizing the concept of frequency while preserving the performance of mining algorithms is nontrivial, and introduce a generalization of frequent itemsets, dense itemsets. Dense itemsets do not require all attributes to be present at the same time; instead, the itemset needs to define a sufficiently large submatrix that exceeds a given density threshold of attributes present.
Time series knowledge mining
, 2006
"... An important goal of knowledge discovery is the search for patterns in data that can help explain the underlying process that generated the data. The patterns are required to be new, useful, and understandable to humans. In this work we present a new method for the understandable description of loca ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
An important goal of knowledge discovery is the search for patterns in data that can help explain the underlying process that generated the data. The patterns are required to be new, useful, and understandable to humans. In this work we present a new method for the understandable description of local temporal relationships in multivariate data, called Time Series Knowledge Mining (TSKM). We define the Time Series Knowledge Representation (TSKR) as a new language for expressing temporal knowledge. The patterns have a hierarchical structure, each level corresponds to a single temporal concept. On the lowest level, intervals are used to represent duration. Overlapping parts of intervals represent coincidence on the next level. Several such blocks of intervals are connected with a partial order relation on the highest level. Each pattern element consists of a semiotic triple to connect syntactic and semantic information with pragmatics. The patterns are very compact, but offer details for each element on demand. In comparison with related approaches, the TSKR is shown to have advantages in robustness, expressivity, and comprehensibility. Efficient algorithms for the discovery of the patterns are proposed. The search for coincidence as well as partial order can be formulated as variants of the well known frequent itemset problem. One of the best known algorithms for this problem is therefore adapted for our purposes. Human interaction is used during the mining to analyze and validate partial results as early as possible and guide further processing steps. The efficacy of the methods is demonstrated using several data sets. In an application to sports medicine the results were recognized as valid and useful by an expert of the field.
Inductive databases and multiple uses of frequent itemsets: the cInQ approach
 In Database Technologies for Data Mining  Discovering Knowledge with Inductive Queries, volume 2682 of LNCS
, 2004
"... Abstract. Inductive databases (IDBs) have been proposed to afford the problem of knowledge discovery from huge databases. With an IDB the user/analyst performs a set of very different operations on data using a query language, powerful enough to perform all the required elaborations, such as data pr ..."
Abstract

Cited by 17 (9 self)
 Add to MetaCart
(Show Context)
Abstract. Inductive databases (IDBs) have been proposed to afford the problem of knowledge discovery from huge databases. With an IDB the user/analyst performs a set of very different operations on data using a query language, powerful enough to perform all the required elaborations, such as data preprocessing, pattern discovery and pattern postprocessing. We present a synthetic view on important concepts that have been studied within the cInQ European project when considering the pattern domain of itemsets. Mining itemsets has been proved useful not only for association rule mining but also feature construction, classification, clustering, etc. We introduce the concepts of pattern domain, evaluation functions, primitive constraints, inductive queries and solvers for itemsets. We focus on simple highlevel definitions that enable to forget about technical details that the interested reader will find, among others, in cInQ publications. 1