Results 1  10
of
61
H.: Association rule ontology matching approach
 International Journal on Semantic Web and Information Systems
, 2007
"... This paper presents a hybrid, extensional and asymmetric matching approach designed to find out semantic relations (equivalence and subsumption) between entities issued from two textual taxonomies (web directories or OWL ontologies). By using the association rule paradigm and a statistical measure d ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
(Show Context)
This paper presents a hybrid, extensional and asymmetric matching approach designed to find out semantic relations (equivalence and subsumption) between entities issued from two textual taxonomies (web directories or OWL ontologies). By using the association rule paradigm and a statistical measure developed in this context, this method relies on the following idea: “An entity A will be more specific than or equivalent to an entity B if the vocabulary (i.e. terms and data) used to describe A and its instances tends to be included in that of B and its instances”. This matching approach is divided into two parts: (1) The representation of each entity by a set of relevant terms and data; (2) The discovery of binary association rules between entities. The selection of rules uses two criteria. The first one permits to assess the implication quality by using implication intensity measure. The second criterion verifies the generativity of the rule and then permits to reduce redundancy. Finally, the proposed method is evaluated on two benchmarks. The first contains two conceptual hierarchies containing textual documents and the second one is composed of OWL ontologies. The experimentations show that the method obtains good precision values and also permits to discover meaningful subsumptions that are not taken into account by similaritybased approaches. 1
Q." Mining High Utility Itemsets without Candidate Generation."Proceedings CIKM12
, 2012
"... ABSTRACT High utility itemsets refer to the sets of items with high utility like profit in a database, and efficient mining of high utility itemsets plays a crucial role in many reallife applications and is an important research issue in data mining area. To identify high utility itemsets, most exi ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
(Show Context)
ABSTRACT High utility itemsets refer to the sets of items with high utility like profit in a database, and efficient mining of high utility itemsets plays a crucial role in many reallife applications and is an important research issue in data mining area. To identify high utility itemsets, most existing algorithms first generate candidate itemsets by overestimating their utilities, and subsequently compute the exact utilities of these candidates. These algorithms incur the problem that a very large number of candidates are generated, but most of the candidates are found out to be not high utility after their exact utilities are computed. In this paper, we propose an algorithm, called HUIMiner (High Utility Itemset Miner), for high utility itemset mining. HUIMiner uses a novel structure, called utilitylist, to store both the utility information about an itemset and the heuristic information for pruning the search space of HUIMiner. By avoiding the costly generation and utility computation of numerous candidate itemsets, HUIMiner can efficiently mine high utility itemsets from the utilitylists constructed from a mined database. We compared HUIMiner with the stateoftheart algorithms on various databases, and experimental results show that HUIMiner outperforms these algorithms in terms of both running time and memory consumption.
Drawbacks and solutions of applying association rule mining in
"... learning management systems ..."
(Show Context)
Discovering coherent value bicliques in genetic interaction data
 In Proceedings of 9th International Workshop on Data Mining in Bioinformatics (BIOKDD’10
, 2000
"... Genetic Interaction (GI) data provides a means for exploring the structure and function of pathways in a cell. Coherent value bicliques (submatrices) in GI data represents functionally similar gene modules or protein complexes. However, no systematic approach has been proposed for exhaustively enume ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Genetic Interaction (GI) data provides a means for exploring the structure and function of pathways in a cell. Coherent value bicliques (submatrices) in GI data represents functionally similar gene modules or protein complexes. However, no systematic approach has been proposed for exhaustively enumerating all coherent value submatrices in such data sets, which is the problem addressed in this paper. Using a monotonic range measure to capture the coherence of values in a submatrix of an input data matrix, we propose a twostep Aprioribased algorithm for discovering all nearly constant value submatrices, referred to as Range Constrained Blocks. By systematic evaluation on an extensive genetic interaction data set, we show that the coherent value submatrices represent groups of genes that are functionally related than the submatrices with diverse values. We also show that our approach can exhaustively find all the submatrices with a range less than a given threshold, while the other competing approaches can not find all such submatrices. 1.
Engineering of softwareintensive systems: State of the art and research challenges
 in SoftwareIntensive Systems and New Computing Paradigms, ser. Lecture Notes in Computer Science
"... ..."
(Show Context)
Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees
 ACM Trans. Knowl. Disc. from Data
"... The tasks of extracting (topK) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, pos ..."
Abstract

Cited by 5 (5 self)
 Add to MetaCart
(Show Context)
The tasks of extracting (topK) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI’s and AR’s are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under or oversampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of VapnikChervonenkis (VC) dimension to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within userspecified parameters. Our technique applies both to absolute and to relative approximations of (topK) FI’s and AR’s. The resulting sample size is linearly dependent on the VCdimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a characterization of the VCdimension of this range space and a proof that it is upper bounded by an easytocompute characteristic quantity of the dataset which we call dindex, namely the maximum integer d such that the dataset contains at least d transactions of length at least d. We show that this bound is strict for a large class of datasets. The resulting sample size for an absolute (resp. relative) (ε, δ)approximation of the collection of FI’s is O ( 1 ε2 (d + log 1 δ)) (resp. O ( 2+ε ε2(2−ε)θ (d log
Analysis of Sampling Techniques for Association Rule Mining
, 2009
"... In this paper, we present a comprehensive theoretical analysis of the sampling technique for the association rule mining problem. Most of the previous works have concentrated only on the empirical evaluation of the effectiveness of sampling for the step of finding frequent itemsets. To the best of o ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
In this paper, we present a comprehensive theoretical analysis of the sampling technique for the association rule mining problem. Most of the previous works have concentrated only on the empirical evaluation of the effectiveness of sampling for the step of finding frequent itemsets. To the best of our knowledge, a theoretical framework to analyze the quality of the solutions obtained by sampling has not been studied. Our contributions are twofold. First, we present the notions of ɛclose frequent itemset mining and ɛclose association rule mining that help assess the quality of the solutions obtained by sampling. Secondly, we show that both the frequent items mining and association rule mining problems can be solved satisfactorily with a sample size that is independent of both the number of transactions size and the number of items. Let θ be the required support, ɛ the closeness parameter, and 1/h the desired bound on the probability of failure. We show that the sampling based analysis succeeds in solving both ɛclose frequent itemset mining and ɛclose association rule mining with a probability of at least (1 − 1/h) with a sample of size S = O ( 1 ɛ2 h [ ∆ + log θ (1−ɛ)θ]), where ∆ is the maximum number of items present in any transaction. Thus, we establish that it is possible to speed up the entire process of association rule mining for massive databases by working with a small sample while retaining any desired degree of accuracy. Our work gives a comprehensive explanation for the well known empirical successes of sampling for association rule mining.
Effective Elimination of Redundant Association Rules
"... Abstract. It is wellrecognized that the main factor that hinders the applications of Association Rules (ARs) is the huge number of ARs returned by the mining process. In this paper, we propose an effective solution that presents concise mining results by eliminating the redundancy in the set of A ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Abstract. It is wellrecognized that the main factor that hinders the applications of Association Rules (ARs) is the huge number of ARs returned by the mining process. In this paper, we propose an effective solution that presents concise mining results by eliminating the redundancy in the set of ARs. We adopt the concept of δTolerance to define the set of δTolerance ARs (δTARs), which is a concise representation for the set of ARs. δTolerance is a relaxation on the closure defined on the support of frequent itemsets, thus allowing us to effectively prune the redundant ARs. We then devise a set of inference rules, with which we prove that the set of δTARs is a nonredundant representation of ARs. In addition, we prove that the set of ARs that is derived from the δTARs by the inference rules is sound and complete. We also develop a compact tree structure called the δTAR tree, which facilitates the efficient generation of the δTARs and derivation of other ARs. Experimental results verify the efficiency of using the δTAR tree to generate the δTARs and to query the ARs. The set of δTARs is also shown to be drastically smaller than the stateoftheart concise representations of ARs.
Mining Associations for Interface Design
 In Proc. International Conference on Rough Sets and Knowledge Technology (RSKT), Joint Rough Set Symposium (JRS
, 2007
"... Abstract. Consumer research has indicated that consumers use compensatory and noncompensatory decision strategies when formulating their purchasing decisions. Compensatory decisionmaking strategies are used when the consumer fully rationalizes their decision outcome whereas noncompensatory decisi ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Consumer research has indicated that consumers use compensatory and noncompensatory decision strategies when formulating their purchasing decisions. Compensatory decisionmaking strategies are used when the consumer fully rationalizes their decision outcome whereas noncompensatory decisionmaking strategies are used when the consumer considers only that information which has most meaning to them at the time of decision. When designing online shopping support tools, incorporating these decisionmaking strategies with the goal of personalizing the design of the user interface may enhance the overall quality and satisfaction of the consumer’s shopping experiences. This paper presents work towards this goal. The authors describe research that refines a previously developed procedure, using techniques in cluster analysis and rough sets, to obtain consumer information needed in support of designing customizable and personalized user interface enhancements. The authors further refine their procedure by examining and evaluating techniques in traditional association mining, specifically conducting experimentation using the Eclat algorithm for use with the authors ’ previous work. A summary discussing previous work in relation to the new evaluation is provided. Results are analyzed and opportunities for future work are described.
Efficient Techniques for Document Sanitization
"... Sanitization of a document involves removing sensitive information from the document, so that it may be distributed to a broader audience. Such sanitization is needed while declassifying documents involving sensitive or confidential information such as corporate emails, intelligence reports, medical ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Sanitization of a document involves removing sensitive information from the document, so that it may be distributed to a broader audience. Such sanitization is needed while declassifying documents involving sensitive or confidential information such as corporate emails, intelligence reports, medical records, etc. In this paper, we present the ERASE framework for performing document sanitization in an automated manner. ERASE can be used to sanitize a document dynamically, so that different users get different views of the same document based on what they are authorized to know. We formalize the problem and present algorithms used in ERASE for finding the appropriate terms to remove from the document. Our preliminary experimental study demonstrates the efficiency and efficacy of the proposed algorithms. disclosure of proprietary information while sharing data with outsourced operations. Example. Figure 1 shows an example U.S. government document that has been sanitized prior to release [16]. This sanitized document gives limited information (such as the purpose and the funding amount) on an erstwhile secret medical research project, while hiding the names of the funding sources, principal investigators and their affiliation.