MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  com

Download:
Download as a PDF | Download as a PS
by Geoff Hulten, Laurie Spencer, Pedro Domingos
http://www.cs.washington.edu/homes/pedrod/kdd01b.ps.gz
Add To MetaCart

Abstract:

Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a stationary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying processes generating them changed during this time, sometimes radically. Although a number of algorithms have been proposed for learning time-changing concepts, they generally do not scale well to very large databases. In this paper we propose an ecient algorithm for mining decision trees from continuously-changing data streams, based on the ultra-fast VFDT decision tree learner. This algorithm, called CVFDT, stays current while making the most of old data by growing an alternative subtree whenever an old one becomes questionable, and replacing the old with the new when the new becomes more accurate. CVFDT learns a model which is similar in accuracy to the one that would be learned by reapplying VFDT to a moving window of examples every time a new example arrives, but with O(1) complexity per example, as opposed to O(w), where w is the size of the window. Experiments on a set of large time-changing data streams demonstrate the utility of this approach.

Citations

3215 C4.5: Programs for Machine Learning – Quinlan - 1993
2438 Classification and Regression Trees – Breiman, Friedman, et al. - 1984
330 Srikant: “Privacy-Preserving Data Mining – Agrawal, R - 2000
161 Mining high-speed data streams – Domingos, Hulten - 2000
147 Maintenance of discovered association rules in large databases: an 356 incremental updating technique – Cheung, Han, et al. - 1996
122 Probability inequalities for sums of bounded random variables – Hoeding - 1963
90 Learning in the presence of concept drift and hidden contexts – Widmer, Kubat - 1996
84 Megainduction: machine learning on very large databases – Catlett - 1991
79 Organization-based analysis of web-object sharing and caching – Wolman, Voelker, et al. - 1999
76 BOAToptimistic Decision Tree Construction – Gehrke, Ganti, et al. - 1999
69 Activity monitoring: Noticing interesting changes in behavior – Fawcett, Provost - 1999
43 Simultaneous Statistical Inference – Miller - 1981
41 Mining surprising patterns using temporal description length – Chakrabarti, Sarawagi, et al. - 1998
39 Decision theoretic subsampling for induction on large databases – Musick, Catlett, et al. - 1993
37 Ramakrishnan R., “DEMON: Mining and Monitoring Evolving Data – Ganti, Gehrke - 2000
35 Beyond incremental processing: Tracking concept drift – Schlimmer, Granger - 1986
21 Learning changing concepts by exploiting the structure of change – Bartlett, Ben-David, et al. - 1996
21 SPRINT: A scalable parallel classi for data mining – Shafer, Agrawal, et al. - 1996
14 Sliq: A fast scalable classi for data mining – Mehta, Agrawal, et al. - 1996
11 The complexity of learning according to two models of a drifting environment – Long - 1998
10 An adaptive algorithm for incremental mining of association rules – Sarda, Srinivas - 1998
8 An efficient algorithm to update large itemsets with early pruning – Ayan, Tansel, et al. - 1999
7 Cost-sensitive learning bibliography. Online bibliography, Institute for Information Technology – Turney - 1997
4 Density-adaptive learning and forgetting – Salganico - 1993
3 Special issue on context sensitivity and concept drift – Kubat - 1998
2 The impact of changing populations on classi performance – Keely, Hand, et al. - 1999