| Provost, F.J., Kolluri, V.: A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery 3 (1999) 131--169 |
.... that use fast induction algorithms, such as C4.5 (shown to be very fast for memory resident data, as compared to a wide variety of other induction algorithms [Lim et al. 2000] It also produces suggestions not commonly considered even by researchers studying scaling up inductive algorithms [Provost Kolluri, 1999]. For example, the enumeration contains plans that use discretization as a preprocess. Research has shown that discretization as a preprocess can produce classifiers with comparable accuracy to induction without the preprocess [Kohavi Sahami, 1996] but with discretization, many induction ....
.... in question to predict which learning algorithm will yield the lowest error on the entire data set; the technique works remarkably well although it should be noted that for large data sets often one can achieve high accuracy with a surprisingly small subset of the data (cf. progressive sampling [Provost et al. 1999]) On the other hand, the relative performance of algorithms can change markedly with the amount of data [Perlich, et al. 2001] St. Amant and Cohen [1998] study intelligent, computer based support for open ended, statistical exploratory data analysis, which is akin to our approach. While ....
. Provost, F. & V. Kolluri. A Survey of Methods for Scaling Up Inductive Algorithms. Data Mining and Knowledge Discovery 3 (2):131-169, 1999.
....The rst part is the cost of the learning process which is repeated periodically. Note that the learning period is 500 generations for the successful tests. Hence, the decision tree generator C4.5 is called once in every 500 generation. Researchers have investigated the time complexity of C4.5. [50] denotes that C4.5 produces good classi ers quickly. It is stated that the asymptotic time complexity of C4.5 is O(ea ) where e is the number of training set elements and a is the number of attributes. The given complexity is for non numeric data sets. It is denoted that numeric data would ....
F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3, 1999.
....location for analysis. Hence, efficient distributed learning algorithms that can operate across multiple autonomous data sources without the need to transmit large amounts of data are needed [Caragea et al. 2001b; Davies and Edwards, 1999; Kargupta et al. 1999; Prodromidis et al. 2000; Provost and Kolluri, 1999] . Data sources of interest are autonomously owned and operated. Consequently, the range of operations that can be performed on the data source (e.g. the types of queries allowed) and the precise mode of allowed interactions can be quite diverse (e.g. PROSITE repository of protein data ....
Foster J. Provost and Venkateswarlu Kolluri. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2):131--169, 1999.
....and outline some avenues for future research. 2. BACKGROUND AND RELATED WORK Conventional approaches to analysis of large scale data focus on probabilistic subsampling and data compression. Data reduction techniques based on probabilistic subsampling have been explored by several researchers [13, 23, 24, 25]. Data compression techniques are generally based on the idea of nding compact representations for data through discovery of dominant patterns or signals. A natural way of compressing data relies on matrix transforms, which have found various applications in large scale data analysis. Variants of ....
F. J. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2):131-169, 1999.
....on the new dataset. The second problem the database is too large to evaluate all (or some) algorithms is usually addressed via subsampling, i.e. the idea of using only part of the available data for training. However, while it has been frequently applied to scaling up data mining algorithms [21], its suitabilility for algorithm recommendation has not found much attention in the literature [19] In this paper, we analyze the combination of subsampling and landmarking in a meta learning scenario. In particular, we will generalize the idea of landmarking [2, 20] to the use of relative ....
Foster Provost and Venkateswarlu Kolluri. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2):131--169, 1999.
....algorithm into a feasible one. A large number of examples introduces potential problems with both time and space complexity. Finally, the goal of the learning (say, classification accuracy) must not be substantially sacrificed by a scaling algorithm. The three main approaches to scaling up include [54] the following: designing a fast algorithm: reducing asymptotic complexity, optimizing the search and representation, finding approximate solutions, or taking advantage of the task s inherent parallelism; partitioning the data: dividing the data into subsets (based on instances or features) ....
F. Provost and V. Kolluri, "A survey of methods for scaling up inductive algorithms," Data Mining Knowledge Discovery, vol. 2, pp. 131--169, 1999.
....warehouses [1] Mining a database of even a few gigabytes is an arduous task for machine learning techniques and requires advanced parallel hardware and algorithms. An approach for dealing with the intractable problem of learning from huge databases is to select a small subset of data for learning [2]. Databases often contain redundant data. It would be convenient if large databases could be replaced by a small subset of representative patterns so that the accuracy of estimates (e.g. of probability density, depen dencies, class boundaries) obtained from such a reduced set should be ....
F. Provost and V. Kolluri, "A survey of Methods for Scaling Up Inductive Algorithms," Data Mining and Knowledge Discovery, vol. 2, pp. 131-169, 1999.
....model represents the data sets represented by the input models. Implementations for these operators are algorithms from distributed data mining or meta learning. The idea of these operators is based on the possibility of further adding scalability to mining algorithm by partitioning the input data [23]. Possible algorithms are for instance [17, 22] These algorithms recompute and restructure the input models. There are two cases for combining models: two model with equal or different input attribute sets, respectively. The first case equal input attribute sets is supported by the union ....
....to provide an integrated data mining environment [14, 11, 20, 27] These languages try to create a descriptive interface to data mining algorithms and thus, the integration with database management systems. Much work has addressed scaling up data mining algorithms. An overview is given in [23]. One task of adding scalability to DM algorithms is their integration with database management systems [6] Different kinds of approaches were proposed: the implementation of the DM algorithms in SQL, e.g. the EM algorithm [21] or usage of user defined functions, e.g. for association rules [24] ....
Foster J. Provost and Venkateswarlu Kolluri. A Survey of Methods for Scaling Up Inductive Algorithms. Data Mining and Knowledge Discovery, 3(2):131 -- 169, Juni 1999.
....Therefore, the data mining tools need to extract valid knowledge from a large amount of data quickly enough in response to the human demand. Many researchers investigated on the methods for fast induction of classi ers from large data sets. The main approaches are classi ed into the following types[8]. Design a fast algorithm: This approach includes a wide variety of algorithm design techniques for reducing the asymptotic complexity, for optimizing the search and representation, for nding approximate solutions, and so on. Partitioning the data: This approach involves breaking the data ....
Provost, F. and Kolluri, V.: A Survey of Methods for Scaling Up Inductive Algorithms, Knowledge Discovery and Data Mining, 3 (2), pp.131{ 169 (1999).
....mined across all available processors, and control parallelism, by distributing the population of individuals across all available processors. 1 INTRODUCTION An important issue in data mining is how a knowledge discovery algorithm scales up with respect to the size of the database being mined [Provost Kolluri 1999]. Intuitively, parallel processing can be regarded as a natural solution to the problem of scalability in data mining [Freitas Lavington 1998] Since genetic algorithms (GAs) tend to be slow, in comparison with most rule induction methods, the design of parallel GAs for data mining is an ....
Provost, F. and Kolluri, V. (1999) A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery 3(2), 131-169.
....algorithm into a feasible one. A large number of examples introduces potential problems with both time and space complexity. Finally, the goal of the learning (say, classification accuracy) must not be substantially sacrificed by a scaling algorithm. The three main approaches to scaling up include [54] . designing a fast algorithm: reducing asymptotic complexity, optimizing the search and representation, finding approximate solutions, or taking advantage of the task s inherent parallelism; partitioning the data: dividing the data into subsets (based on instances or features) learning ....
F. Provost and V. Kolluri, "A survey of methods for scaling up inductive algorithms," Data Mining and Knowledge Discovery, vol. 2, pp. 131--169, 1999.
....0 is set near to the OSS. Moreover, as k = log a Sk 1 n0 # log a OSS n0 , if n 0 is much less than OSS, then k, the number of samples needed before convergence, will be large. Note that generating a random sample from a single table database typically requires scanning the entire table once [12]. Thus a large k will result in considerably high disk I O cost. Therefore setting a good starting sample size can further improve the e#ciency of PS by cutting the two kinds of costs. In this paper, we find such a size via a statistical approach. The intuition is that a sample with the OSS ....
F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Machine Learning, pages 1--42, 1999.
....all of the observed data, but instead focus in a datadriven manner on local pockets of information. 4 The engineering of scale, namely, the data engineering aspects of scaling traditional algorithms to handle massive data sets. Work in this area involves both computationally driven approaches [ZRL97, ML98, BPR98, GGRL99, PK99] as well as statisticallymotivated techniques [DVJ99] Worth mentioning in this context is the fact that researchers in a variety of areas such as speech, natural language modeling, human genome analysis, and so forth, have all developed a variety of practical learning algorithms and tricks of ....
Provost, F. and Kolluri, V. (1999) A survey of methods for scaling up inductive algorithms, Journal of Data Mining and Knowledge Discovery, 3(2), 131--169.
....accuracy. The eight large UCI benchmark data sets [3] used in this study are summarized in Table 2. Note that the number of instances of these data sets range from three thousands to eighty thousands. Although they are not very large from the view of many real world applications of data mining [14], they are fairly large from the view of research. Table 2. Descriptions of 8 UCI data sets Data Set Notion Nominal Numeric Instances # Instances # C4.5 max LOG max Name Attribute Attribute for Training for Testing accuracy accuracy abalone aba 1 7 3000 1177 45.0 25.6 adult adu 8 6 36000 ....
F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Machine Learning, pages 1--42, 1999.
....database available from the UCI Machine Learning Repository (Blake Merz, 1998) This database only has 67557 instances but it contains 42 attributes which are highly correlated. Finally STUCCO is amenable to speedup methods such as windowing, sampling, and limiting the depth of the search (Provost Kolluri, 1999). This will speed up MVD accordingly. 4.2. Relation to Other Discretization Approaches Our bottom up merging process is similar to other discretization algorithms such as ChiMerge (Kerber, 1992) and Chi2 (Liu Setiono, 1995) They divide the data into intervals and then merge them on the basis ....
Provost F, & Kolluri V (1999) A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3: 131--169.
....of data re ecting the interactions of millions of people around the world. Each source o ers the opportunity to infer something about the players involved and the knowledge they possess. Data mining algorithms typically fast algorithms for extracting knowledge from massive quantities of data [8, 28] seem particularly well suited for the job. In this work, we employed simple extraction algorithms to obtain probabilistic forecasts of real world events. Our results can be seen as statistical validation of the underlying quality of data from online games. The games themselves appear to serve as ....
F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2):131-169, 1999.
....In this paper we formally define data mining and scaling up. We justify the need for scaling up techniques and provide an overview of the various techniques available. Very often it is very difficult to mine large databases in real time. Several approaches are proposed to scale up data mining [19, 8]. In this paper we discuss algorithm oriented and data oriented approaches to scale up data mining. Sampling is discussed in detail and we provide two case studies to illustrate the usefulness of sampling to scale up data mining. 1 2 What is data mining and why data mining Data mining is a ....
....up is the process of handling large amounts of data. Some also consider it as the process of increasing the speed of data mining. Although large data sets are necessary for reliable results, large databases are not necessarily advantages for the following reasons: ffl Not all data is informative [14, 19, 15]. ffl High degree of redundancy in the databases [9, 11] 2 ffl Experimental studies on the entire database are expensive [3] This is the basic problem in genebank collections and in drug industry. For example to conduct genetic studies on a single gene we need many resources and it is ....
F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 2:1--42, 1999.
....databases has become a challenging problem in data mining research. Fast and accurate classifiers that can scale to large datasets have been the focus of research in recent years. A subset of the entire dataset, obtained via sampling, can be used to construct a classification model of the database [13, 14]. However, it is still desirable to have decision tree classifiers that can handle very large, out of core training datasets, which do not fit 1 H H L Age 27.5 Car in Sports,SUV Decision Tree Training Set Age Car Risk Age Car Risk Age Car Risk 40 Sedan L 28 Sedan L 43 Sports H 35 ....
....SUV H 26 Sports H 20 Sedan H 23 Sports H Age Car Risk 23 Sports H 20 Sedan H 26 Sports H 40 Sedan L 28 Sedan L 35 SUV H 43 Sports H 35 SUV H Figure 1. An example decision tree. in memory, because a large training set often increases the accuracy of the resulting classification model [14, 16]. Because of large memory and computation requirements, parallelization is a viable approach for handling large datasets. Several parallel decision tree algorithms have been proposed for distributed memory [6, 8, 14, 16, 19, 18, 21] and sharedmemory parallel machines [11, 14, 22, 23] Clusters of ....
[Article contains additional citation context not shown here]
F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Knowledge Discovery and Data Mining, 3, 1999.
No context found.
Provost, F.J., Kolluri, V.: A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery 3 (1999) 131--169
No context found.
F. Provost and V. Kolluri, "A Survey of Methods for Scaling up Inductive Algorithms," Knowledge Discovery and Data Mining, vol. 3, 1999.
No context found.
F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2):131--169, 1999.
No context found.
F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Knowledge Discovery and Data Mining, 3, 1999.
No context found.
Foster J. Provost and Venkateswarlu Kolluri. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2):131--169, 1999.
No context found.
#327. Provost, F. and Kolluri, V. #1999#. A survey of methods for scaling up inductive algorithms. Journal of Data Mining and Knowledge Discovery,
No context found.
Provost, F. and V. Kolluri, A Survey of Methods for Scaling Up Inductive Algorithms. Data Mining and Knowledge Discovery 3 (1999).
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC