8 citations found. Retrieving documents...
F.J. Provost, D. Jensen, and T. Oates, E#cient Progressive Sampling, in Proc. of the 5th ACM SIGKDD Intl. Conference on Knowledge Discovery and Data Mining, ACM Press, pp.23--32, 1999.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Sequential Sampling Algorithms: Unified Analysis and Lower.. - Gavalda, Watanabe (2001)   (1 citation)  (Correct)

....But they focused mostly on minimizing the number of experiments for hypothesis testing. Only very recently, some sequential sampling algorithms have been developed that (i) can be applied to current, general, KDD tasks and that (ii) have theoretical guarantees of correctness and performance [5, 6, 20, 21, 22]. In these works, the algorithms have been experimentally tested against those using non sequential sampling, or no sampling at all, and have systematically outperformed them. Typically, the performance of these algorithms depends on the quality of the data rather than the size of the database; so ....

....this reduces a log t factor in the running time to some log log t factor. In [6] algorithms using this trick are referred to as geometric algorithms, by comparison with arithmetic algorithms that check some stopping condition at every step, or after each constant number of steps as in [13] In [20] sampling schedules for machine learning problems are considered. Independently, the terms arithmetic and geometric schedules are proposed there to distinguish between essentially the same strategies. Also, geometric schedules are proved to be optimal in some precise sense. 4 Case Study (1) ....

F.J. Provost, D. Jensen, and T. Oates, E#cient Progressive Sampling, in Proc. of the 5th ACM SIGKDD Intl. Conference on Knowledge Discovery and Data Mining, ACM Press, pp.23--32, 1999.


Mining Complex Models from Arbitrarily Large Databases in.. - Hulten, Domingos (2002)   (2 citations)  (Correct)

....AdaSelect algorithm [2] Our method goes beyond these in applying to any type of discrete search, providing new formal results, working within pre specified memory limits, supporting interleaving of search steps, learning from timechanging data, etc. A related approach is progressive sampling [14, 15], where successively larger samples are tried, a learning curve is fit to the results, and this curve is used to decide when to stop. This may lead to stopping earlier than with our method, but stopping can also occur prematurely, due to the di#culty in reliably extrapolating learning curves. ....

F. Provost, D. Jensen, and T. Oates. E#cient progressive sampling. In Proc. 5th ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining, pp. 23--32, San Diego, CA, 1999.


Distributed Pasting of Small Votes - Bowyer (2002)   (1 citation)  (Correct)

....Bank [16] The size of these important datasets poses a challenge for developers of machine learning algorithms and software how to construct accurate and e#cient models. The machine learning community has essentially focused on two directions to deal with massive datasets: data subsampling [14, 17], and the design of parallel or distributed algorithms capable of handling all the data [5, 15, 10, 6] The latter approaches try to bypass the need for loading the entire dataset into the memory of a single computer by distributing the dataset across a group of computers. Evaluating these two ....

Provost, F., Jensen D., Oates, T.: E#cient progressive sampling. Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (1999) 23 -- 32.


Efficiently Determine the Starting Sample Size for.. - Gu, Liu, Hu, Liu   (Correct)

.... linearly increases with the size of training data, and additional complexity in the tree results in no significant increase in model accuracy [9] Another recent work on improving the e#ciency of tree building for large data sets is Progressive Sampling (PS for short) proposed by Provost et al. [11]. By means of a learning curve (see an example learning curve in Figure 1) which depicts the relationship between sample size and # Email: gubh comp.nus.edu.sg; School of Computing, National University of Singapore Email: liub comp.nus.edu.sg; School of Computing, National University of ....

....(the highest achievable on the entire data) by feeding a learning algorithm with progressively larger samples. Assuming a well behaved learning curve, it will stop at a size equal to or slightly larger than the optimal sample size (OSS for short) corresponding to the optimal model accuracy. [11] shows that PS is more e#cient than using the entire data. It also avoids loading the entire data into memory, and can produce a less complex tree or model (if the real OSS is far less than the total data size) In this paper, we restrict our attention to PS and aim to improve it further by ....

[Article contains additional citation context not shown here]

F. Provost, D. Jensen, and T. Oates. E#cient progressive sampling. In Proceedings of KDD'99. AAAI/MIT Press, 1999.


Bagging-Like Effects for Decision Trees and Neural Nets in.. - Thomas (2001)   (Correct)

....strategy for combining rules, the accuracy usually did not decrease for a small number of partitions, at least on the datasets that were tested. Our current work is similar to this, but focuses on comparison of bagging like approaches to simple partitioning of large datasets. Provost et al. [22] found that sub sampling the data gave the same accuracy as learning from the entire dataset at much lower computational cost. They analyzed progressive sampling methods progressively increasing the sample size until the model accuracy was maintained. It was found that adding more training ....

F. Provost, D. Jensen, and T. Oates. E#cient progressive sampling. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 23--32, 1999.


Modelling Classification Performance for Large Data Sets - An.. - Gu, Hu, Liu (2001)   (1 citation)  (Correct)

.... testing data) Theoretical and empirical studies have suggested that learning curves typically have a fast increasing portion early in the curve, following with a relatively slow increasing portion and finally a plateau portion when the learning accuracy no longer increases with more training data [13]. In practice, the learning curve can be modelled by fitting a group of (Size, Acc) points (called learning points hereafter in this paper) in order to avoid the redundant and expensive work of learning every size of a data set [9] The learning curve, if available, can be very useful: 1) to ....

F. Provost, D. Jensen, and T. Oates. E#cient progressive sampling. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining(KDD'99), pages 23--32. AAAI/MIT Press, 1999.


Mining High-Speed Data Streams - Domingos, Hulten (2000)   (81 citations)  (Correct)

....to the batch tree were much looser than those derived here for Hoe#ding trees, and it was only tested on repeatedly sampled small datasets. Gehrke et al. s BOAT [5] learned an approximate tree using a fixed size subsample, and then refined it by scanning the full database. Provost et al. [14] studied di#erent strategies for mining larger and larger subsamples until accuracy (apparently) asymptotes. In contrast to systems that learn in main memory by subsampling, systems like SLIQ [10] and SPRINT [17] use all the data, and concentrate on optimizing access to disk by always reading ....

F. Provost, D. Jensen, and T. Oates. E#cient progressive sampling. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 23--32, San Diego, CA, 1999. ACM Press.


Intelligent Agents in Electronic Markets for.. - Aron, Sundararajan.. (2001)   (Correct)

No context found.

Provost,F.,Jensen,D.andOates,T.#999. E#cient Progressive Sampling.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC