Results 1 - 10
of
23
Sampling Large Databases for Association Rules
, 1996
"... Discovery of association rules is an important database mining problem. Current algorithms for nding association rules require several passes over the analyzed database, and obviously the role of I/O overhead is very signi cant for very large databases. We present new algorithms that reduce the data ..."
Abstract
-
Cited by 329 (4 self)
- Add to MetaCart
Discovery of association rules is an important database mining problem. Current algorithms for nding association rules require several passes over the analyzed database, and obviously the role of I/O overhead is very signi cant for very large databases. We present new algorithms that reduce the database activity considerably. Theidea is to pick a random sample, to ndusingthis sample all association rules that probably hold in the whole database, and then to verify the results with the restofthe database. The algorithms thus produce exact association rules, not approximations based on a sample. The approach is, however, probabilistic, and inthose rare cases where our sampling method does not produce all association rules, the missing rules can be found inasecond pass. Our experiments show that the proposed algorithms can nd association rules very e ciently in only onedatabase pass. 1
An Algorithm for Multi-Relational Discovery of Subgroups
, 1997
"... We consider the problem of finding statistically unusual subgroups in a multi-relation database, and extend previous work on singlerelation subgroup discovery. We give a precise definition of the multirelation subgroup discovery task, propose a specific form of declarative bias based on foreign ..."
Abstract
-
Cited by 105 (8 self)
- Add to MetaCart
We consider the problem of finding statistically unusual subgroups in a multi-relation database, and extend previous work on singlerelation subgroup discovery. We give a precise definition of the multirelation subgroup discovery task, propose a specific form of declarative bias based on foreign links as a means of specifying the hypothesis space, and show how propositional evaluation functions can be adapted to the multi-relation setting. We then describe an algorithm for this problem setting that uses optimistic estimate and minimal support pruning, an optimal refinement operator and sampling to ensure efficiency and can easily be parallelized.
Multiple uses of frequent sets and condensed representations (Extended Abstract)
- In Proc. KDD Int. Conf. Knowledge Discovery in Databases
, 1996
"... In interactive data mining it is advantageous to have condensed representations of data that can be used to efficiently answer different queries. In this paper we show how frequent sets can be used as a condensed representation for answering various types of queries. Given a table r with 0/1 values ..."
Abstract
-
Cited by 84 (7 self)
- Add to MetaCart
In interactive data mining it is advantageous to have condensed representations of data that can be used to efficiently answer different queries. In this paper we show how frequent sets can be used as a condensed representation for answering various types of queries. Given a table r with 0/1 values and a threshold oe, a frequent set of r is a set X of columns of r such that at least a fraction oe of the rows of r have a 1 in all the columns of X. Finding frequent sets is a first step in finding association rules, and there exists several efficient algorithms for finding the frequent sets. We show that frequent sets have wider applications than just finding association rules. We show that using the inclusion-exclusion principle one can obtain approximate confidences of arbitrary boolean rules. We derive bounds for the errors in the confidences, and show that information collected during the computation of frequent sets can also be used to provide individual error bounds for each clause...
Methods and Problems in Data Mining
, 1997
"... Knowledge discovery in databases and data mining aim at semiautomatic tools for the analysis of large data sets. We consider some methods used in data mining, concentrating on levelwise search for all frequently occurring patterns. We show how this technique can be used in various applications. We a ..."
Abstract
-
Cited by 64 (2 self)
- Add to MetaCart
Knowledge discovery in databases and data mining aim at semiautomatic tools for the analysis of large data sets. We consider some methods used in data mining, concentrating on levelwise search for all frequently occurring patterns. We show how this technique can be used in various applications. We also discuss possibilities for compiling data mining queries into algorithms, and look at the use of sampling in data mining. We conclude by listing several open research problems in data mining and knowledge discovery.
Data mining, hypergraph transversals, and machine learning
, 1997
"... Several data mining problems can be formulated as problems of finding maximally specific sentences that are interesting in a database. We first show that this problem has a close relationship with the hypergraph transversal problem. We then analyze two algorithms that have been previously used in da ..."
Abstract
-
Cited by 59 (5 self)
- Add to MetaCart
Several data mining problems can be formulated as problems of finding maximally specific sentences that are interesting in a database. We first show that this problem has a close relationship with the hypergraph transversal problem. We then analyze two algorithms that have been previously used in data mining, proving upper bounds on their complexity. The first algorithm is useful when the maximally specific interesting sentences are "small". We show that this algorithm can also be used to efficiently solve a special case of the hypergraph transversal problem, improving on previous results. The second algorithm utilizes a subroutine for hypergraph transversals, and is applicable in more general situations, with complexity close to a lower bound for the problem. We also relate these problems to the model of exact learning in computational learning theory, and use the correspondence to derive some corollaries. 1
Learning Action Strategies for Planning Domains
- ARTIFICIAL INTELLIGENCE
, 1997
"... This paper reports on experiments where techniques of supervised machine learning are applied to the problem of planning. The input to the learning algorithm is composed of a description of a planning domain, planning problems in this domain, and solutions for them. The output is an efficient algori ..."
Abstract
-
Cited by 58 (2 self)
- Add to MetaCart
This paper reports on experiments where techniques of supervised machine learning are applied to the problem of planning. The input to the learning algorithm is composed of a description of a planning domain, planning problems in this domain, and solutions for them. The output is an efficient algorithm --- a strategy --- for solving problems in that domain. We test the strategy on an independent set of planning problems from the same domain, so that success is measured by its ability to solve complete problems. A system, L2Act, has been developed in order to perform these experiments. We have experimented with the blocks world domain, and the logistics domain, using strategies in the form of a generalization of decision lists, where the rules on the list are existentially quantified first order expressions. The learning algorithm is a variant of Rivest`s [39] algorithm, improved with several techniques that reduce its time complexity. As the experiments demonstrate, generalization is a...
Discovering all Most Specific Sentences by Randomized Algorithms (Extended Abstract)
- In Intl. Conf. on Database Theory
, 1997
"... Dimitrios Gunopulos 1 and Heikki Mannila 2 and Sanjeev Saluja 3 1 Max-Planck-Insitut Informatik, Im Stadtwald, 66123 Saarbrucken, Germany. gunopulo@mpi-sb.mpg.de 2 University of Helsinki, Dept. of Computer Science, FIN-00014 Helsinki, Finland. Heikki.Mannila@cs.helsinki.fi. Work supported by ..."
Abstract
-
Cited by 47 (5 self)
- Add to MetaCart
Dimitrios Gunopulos 1 and Heikki Mannila 2 and Sanjeev Saluja 3 1 Max-Planck-Insitut Informatik, Im Stadtwald, 66123 Saarbrucken, Germany. gunopulo@mpi-sb.mpg.de 2 University of Helsinki, Dept. of Computer Science, FIN-00014 Helsinki, Finland. Heikki.Mannila@cs.helsinki.fi. Work supported by Alexander von Humbold-Stiftung and the Academy of Finland. 3 Max-Planck-Institut Informatik, Im Stadtwald, 66123 Saarbrucken, Germany. saluja@mpi-sb.mpg.de Abstract. Data mining can in many instances be viewed as the task of computing a representation of a theory of a model or a database. In this paper we present a randomized algorithm that can be used to compute the representation of a theory in terms of the most specific sentences of that theory. In addition to randomization, the algorithm uses a generalization of the concept of hypergraph transversal. We apply the general algorithm, for discovering maximal frequent sets in 0/1 data, and for computing minimal keys in relations. We prese...
Data Mining: Machine Learning, Statistics, and Databases
- In Proceedings of the 8th International Conference on Scientific and Statistical Database Management
, 1996
"... Knowledge discovery in databases and data mining aim at semiautomatic tools for the analysis of large data sets. We give an overview of the area and present some of the research issues, especially from the database angle. 1 Introduction Knowledge discovery in databases (KDD), often called data mi ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
Knowledge discovery in databases and data mining aim at semiautomatic tools for the analysis of large data sets. We give an overview of the area and present some of the research issues, especially from the database angle. 1 Introduction Knowledge discovery in databases (KDD), often called data mining, aims at the discovery of useful information from large collections of data. The discovered knowledge can be rules describing properties of the data, frequently occurring patterns, clusterings of the objects in the database, etc. Data mining has in the 1990's emerged as visible research and development area; both in industry and in science there seems to be a lack of methods for efficient analysis of large data sets. Current technology makes it fairly easy to collect data, but data analysis tends to be slow and expensive. There is a suspicion that there might be nuggets of useful information hiding in the masses of unanalyzed or underanalyzed data, and therefore semiautomatic methods fo...
Multi-Relational Decision Tree Induction
- In Proceedings of PKDD’ 99, Prague, Czech Republic, Septembre
, 1999
"... Discovering decision trees is an important set of techniques in KDD, both because of their simple interpretation and the efficiency of their discovery. One of their disadvantages is that they do not take the structure of the mining object into account. By going from the standard single-relation appr ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Discovering decision trees is an important set of techniques in KDD, both because of their simple interpretation and the efficiency of their discovery. One of their disadvantages is that they do not take the structure of the mining object into account. By going from the standard single-relation approach to the multi-relational approach as in ILP this disadvantage is removed. However, the straightforward generalization loses the efficiency of the standard algorithms. In this paper we present a framework that allows the efficient discovery of multi-relational decision trees through the exploitation of the domain knowledge encoded in the data model of the database. Introduction The induction of decision trees has been getting a lot of attention in the field of Knowledge Discovery in Databases over the past few years. This popularity has been largely due to the efficiency with which decision trees can be induced from large datasets, as well as to the elegant and intuitive representation ...
Efficient Read-Restricted Monotone CNF/DNF Dualization by Learning with Membership Queries
, 1998
"... We consider exact learning monotone CNF formulas in which each variable appears at most some constant k times ("read-k" monotone CNF). Let ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
We consider exact learning monotone CNF formulas in which each variable appears at most some constant k times ("read-k" monotone CNF). Let

