18 citations found. Retrieving documents...
Provost, F., Jensen, D., & Oates, T. (1999). Efficient progressive sampling. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining. ACM Press.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Generation of Comprehensible Decision Trees - Through Evolution Of   (Correct)

....According to Oates and Jensen [8] the performance of the DT often does not change when the training set size becomes larger than a certain threshold. This is the rationale of the first method. Using this property, Provost et al. have also proposed a method for reducing train ing complexity [9]. Clearly, if we use the second method listed above, the tree size can be further reduced, and this is what we want to do in this study. In this paper, we just fix the genotype length, because adopting variable length genotype straightforwardly often leads to very large solutions. To measure the ....

F. J. Provost, D. Jensen and T. Oates, "Efficient Progressive Sampling," Knowledge Discovery and Data Mining, 23-32, 1999.


On the Use of Fast Subsampling Estimates for.. - Fürnkranz, Petrak.. (2002)   (Correct)

....is the size of the subsample that has to be drawn in order to guarantee a satisfactory performance at a reasonable time. Approaches to tackle this problem range from statistical estmiates for appropriate subsample sizes [15, 29, 24, 9] over attempts to model the shape of the learning curve [13, 22], to active learning and windowing techniques that give the learning algorithm itself control over the subsampling process [8, 10] However, the main focus of previous works on this topic has been on evaluating the performance of single algorithms on a subsample. The use of subsampling for ....

Foster J. Provost, David Jensen, and Tim Oates. Efficient progressive sampling. In Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining (KDD-99), pages 23--32, 1999.


Bagging-Like Effects for Decision Trees and Neural .. - Chawla, Moore.. (2001)   (Correct)

....strategy for combining rules, the accuracy usually did not decrease for a small number of partitions, at least on the datasets that were tested. Our current work is similar to this, but focuses on comparison of bagging like approaches to simple partitioning of large datasets. Provost et al. [22] found that sub sampling the data gave the same accuracy as learning from the entire dataset at much lower computational cost. They analyzed progressive sampling methods progressively increasing the sample size until the model accuracy was maintained. It was found that adding more training ....

F. Provost, D. Jensen, and T. Oates. Efficient progressive sampling. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 23--32, 1999.


Using Data Mining for Crop Genebank Management - Addala, al.   (Correct)

....knowledge can also be investigated (another important issue in data mining) 3. 2 Determining the Smallest Sample Size For each of the above sampling methods, we are applying the progressive sampling method to determine the smallest sample size that maintains the maximum predictive accuracy [17]. Progressive sampling starts with a small sample and uses progressively larger ones until the predictive accuracy no longer improves. A central component of progressive sampling is a sampling schedule S = n 0 , n 1 , n k where each n i is an integer that specifies the size of a sample ....

F. Provost, D. Jensen, and Tim Oates. Efficient progressive sampling. In Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 15-18, 1999.


Sampling: An efficient, simple and robust technique for scaling.. - Addala   (Correct)

....single gene we need many resources and it is impossible to conduct studies on all of the genes. So geneticists carefully sample a relevant subset of genes and conduct lab studies on them and later extend these results to other genes. ffl Large databases does not always guarantee desirable results [18, 16]. ffl Efficiency of knowledge discovery algorithms depend on the size of the data set [17] ffl High cost of I O operations [15] ffl Dynamic nature of some large databases will affect discovered knowledge. For example, new data may be added from time to time and as a result, some existing ....

.... with decreased accuracy while any sample greater than N min results in a model with no additional gain in accuracy (when compared to the model generated by N min , See Figure 1) 5 Perhaps the best way to determine the smallest n is the progressive sampling method proposed by Foster et al. [18]. Progressive sampling starts with a small sample and uses progressively larger ones until model accuracy no longer improves. A central component of progressive sampling is a sampling schedule S = fn 0 ; n 1 ; n 2 ; Delta Delta Delta ; n k g where each n i is an integer that specifies the size ....

[Article contains additional citation context not shown here]

F. Provost, D. Jensen, and Tim Oates. Efficient progressive sampling. In Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 15-18, 1999.


Machine Learning from Imbalanced Data Sets 101 (Extended Abstract) - Provost   (Correct)

....better to skew the training distribution toward the class with the larger proportion of small disjuncts. This makes sense, since much research has shown that small disjuncts are more error prone (Weiss, 2000) If this turns out to be so, it may be possible to design a progressive sampling strategy (Provost, Jensen, and Oates, 1999) that decides how next to sample by analyzing the prevalence of small disjuncts in the different classes. Is understanding these basics really worthwhile I have tried to provide rationale here for understanding the basics, including: the need for control conditions for empirical analyses, the ....

Provost, F., D. Jensen and T. Oates (1999). "Efficient Progressive Sampling." In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining.


Types of Cost in Inductive Concept Learning - Turney (2000)   (10 citations)  (Correct)

....documents that we have correctly classified. As another example, in activity monitoring, if you issue an alarm twice in succession for the same problem, the benefit of the second alarm is less than the benefit of the first alarm, assuming both alarms are correct classifications (Fawcett and Provost, 1999). This is related to Section 2.2.2. 2.2.4 ERROR COST CONDITIONAL ON FEATURE VALUE The cost of making a classification error with a particular case may depend on the value of one or more features of the case. 3. Cost of Tests Each test (i.e. attribute, measurement, feature) may have an ....

....errors, and (4) the cost of acquiring cases for training data, then we can calculate the combined cost of training (building the model) and operating (using the model) as a function of training set size. We can then optimize the size of the training set to minimize this combined cost (Provost et al. 1999). Alternatively, an adaptive learning system, given (1) the expected number of classifications that the learned model will make when embedded in the operational system, 2) the cost of misclassification errors, and (3) the cost of acquiring cases for training data, could adjust its learning curve ....

Provost, F.J., Jensen, D., and Oates, T. (1999). Efficient progressive sampling. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, KDD-99.


Heterogeneous Multimedia Database Selection on the Web - Kim, Lee, Lee, Chung (2000)   (1 citation)  (Correct)

....i , q, LT) In order to estimate Diff, PSC(db i , q, GT) in the formula (11) can be used. But PSC cannot be computed at the metaserver. Instead, we may fetch sample objects from local databases to the metaserver, extract features, and compute SSC. To do it, we use the progressive sampling method[PJO99] in the preprocessing phase. Progressive sampling starts with a small sample and uses progressively larger ones until the 16 accuracy of SSC is not improved any more. In order to show the accuracy of the estimation method using SSC, we have to show that SSC converges to PSC. In order to ....

F. Provost, D. Jensen, T. Oates. Efficient Progressive Sampling. Proceedings of ACM SIGKDD Int`l Conference on Knowledge Discovery and Data Mining, pages.23 -- 32, Aug. 1999.


Uses of Convexity in Numerical Domain Partitioning - Elomaa, Rousu (2000)   (Correct)

....domains is a potential time consumption bottleneck of induction, since in the general case the number of possible partitions is exponential in the number of potential cut points within the domain. Numerical attributes have been observed to slow down, e.g. the C4.5 decision tree learning algorithm [14, 12]. Restricted classes of attribute evaluation functions have efficient optimization algorithms, but only the Training Set Error is known to optimize in linear time [1, 2, 11] The class of so called cumulative functions can be optimized in quadratic time in the number of possible cut points [6, ....

F. Provost, D. Jensen, and T. Oates. Efficient progressive sampling. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 23--32, New York, 1999. ACM Press.


Intelligent Assistance for the Data Mining Process: . . . - Bernstein, Hill, Provost (2002)   Self-citation (Provost)   (Correct)

.... in question to predict which learning algorithm will yield the lowest error on the entire data set; the technique works remarkably well although it should be noted that for large data sets often one can achieve high accuracy with a surprisingly small subset of the data (cf. progressive sampling [Provost et al. 1999]) On the other hand, the relative performance of algorithms can change markedly with the amount of data [Perlich, et al. 2001] St. Amant and Cohen [1998] study intelligent, computer based support for open ended, statistical exploratory data analysis, which is akin to our approach. While ....

F. Provost, D. Jensen, and T. Oates. Efficient Progressive Sampling. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 23-32.


An Intelligent Assistant for the Knowledge Discovery Process - Bernstein, Provost (2001)   (6 citations)  Self-citation (Provost)   (Correct)

.... question to predict which learning algorithm will yield the lowest error on the entire data set; the technique works remarkably well although it should be noted that for large data sets often one can achieve maximal accuracy with a surprisingly small subset of the data (cf. progressive sampling [Provost et al. 1999]) The StatLog project 6 [Michie et al. 1994] has investigated what induction algorithms to use given particular circumstances. The knowledge generated from such projects could be of great use to populate the ontology, as well as to inform the construction of more advanced heuristic functions ....

F. Provost, D. Jensen, and T. Oates. Efficient Progressive Sampling. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 23-32.


The Effect of Class Distribution on Classifier Learning: An.. - Weiss (2001)   (8 citations)  Self-citation (Provost)   (Correct)

....empirical study of the effect of class distribution on classifier learning. The goal of this article is not how to find the optimal class distribution efficiently. However, if in practice it is cost effective to procure data incrementally, we suggest that a progressive, adaptive, sampling strategy (Provost, Jensen Oates, 1999) be developed that incrementally requests new examples based on the improvement in classifier performance due to the recently added minority class and majority class examples. The best class distribution can then be estimated by using cross validation. Our results are based only on C4.5, a ....

Provost, F., Jensen, D., & Oates, T. (1999). Efficient progressive sampling. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining. ACM Press.


A Survey of Methods for Scaling Up Inductive Algorithms - Provost, Kolluri (1999)   (31 citations)  Self-citation (Provost)   (Correct)

.... In Section 7 we discuss a similar technique for determining the minimum number of training examples sufficient for satisfactory learning, namely, 18 PROVOST AND KOLLURI progressively sampling larger subsets until model performance no longer improves (John and Langley 1996; Frey and Fisher 1999; Provost, Jensen, and Oates 1999). 6.2.2. Select a subset of the features So far, our discussion of data partitioning has focused on selecting a subset of the examples. Let us now turn to the problem of selecting a subset of features. It is important to consider the symmetry with selecting instance subsets: one method selects ....

.... that the run time complexity of inductive algorithms is at best linear in the number of examples, and often worse, relatively inexpensive experiments can be conducted on small samples in order to estimate the number of examples that are actually needed (John and Langley 1996; Frey and Fisher 1999; Provost, Jensen, and Oates 1999). In cases where the number of examples needed is much smaller than the number available, such procedures can provide substantial practical speedups. Subsets of the examples should be sampled, using stratified sampling when one class dominates strongly. Subsets of the features should also be ....

[Article contains additional citation context not shown here]

Provost, F., D. Jensen, and T. Oates (1999). Efficient progressive sampling. Technical report 99-14, Department of Computer Science, University of Massachusetts/Amherst.


The Effect Of Small Disjuncts And - Class Distribution On   (Correct)

No context found.

Provost, F., Jensen, D., & Oates, T. (1999). Efficient progressive sampling. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining. ACM Press.


Generalization Methods in Bioinformatics - Eschrich, Chawla, Hall (2002)   (Correct)

No context found.

Foster Provost, David Jensen, and Tim Oates. Efficient progressive sampling. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining (KDD-99), pages 23--32, 1999.


CONQUEST: A Distributed Tool for Constructing Summaries of.. - Chi, Koyuturk, Grama (2004)   (Correct)

No context found.

F. J. Provost, D. Jensen, and T. Oates. Efficient progressive sampling. In Knowledge Discovery and Data Mining, pages 23--32, 1999.


Efficient Progressive Sampling for Association Rules - Parthasarathy (2002)   (Correct)

No context found.

Foster J. Provost, David Jensen, and Tim Oates. Efficient progressive sampling. In KDD, 1999.


Bagging Is A Small-Data-Set Phenomenon - Nitesh Chawla Thomas (2001)   (2 citations)  (Correct)

No context found.

F.J. Provost, D. Jensen, and T. Oates. Efficient progressive sampling. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 23--32, 1999.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC