31 citations found. Retrieving documents...
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree construction of large datasets. In Proc. 1998.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents

CMP: A Fast Decision Tree Classifier Using Multivariate.. - Wang, Zaniolo (2000)   (1 citation)  (Correct)

.... [03] read every record from dataset and update histogram matrix in root node; 05] For each record r in database do [10] End For [11] sort records in the buffer to derive a best split point(for the last split) 13] update histogram matrix for records in the buffer [15] For each node do [16] compute gini min and gini along X and Y axis for each histogram matrix; 17] split node into 2 or 3 subnodes depending on the number of alive intervals; 18] if split is on X axis and there is only 1 alive interval, split the subnodes again; 19] Call predictSplit to predict next splitting ....

....is decided by the truth value of Function f : f : age 40) salary commission 100, 000) 8 optimal split current split Figure 11. Choosing a Best Split Position on Linear Combination of 2 Variables Figure 9 shows the decision tree built by algorithms such as Sprint or RainForest[16] for Function f . The decision tree tends to be very big and is hardly comprehensible. On the contrary, an optimal decision tree (Figure 1(b) will have only 2 levels and very easy to understand by humans. Deriving splitting line CMP takes advantage of the two dimensional histogram matrices ....

[Article contains additional citation context not shown here]

Johannes Gehrke, Raghu Ramakrishnan, Venkatesh Ganti. "RainForest: A Framework for Fast Decision Tree Constructionof Large Datasets." in Proceedings of the 24th VLDB Conference, New York, USA, VLDB 1998.


CPAR: Classification based on Predictive Association Rules - Yin, Han (2003)   (Correct)

....its weight is decreased by multiplying a factor. This weighted version of FOIL produces more rules and each positive example is usually covered more than once. The most time consuming part of FOIL is evaluating every literal when searching for the one with the highest gain. In fact, similar to [4], to calculate the gain, we only need to know the information stored in a data structure called PNArray. Definition 3.1. PNArray) A PNArray stores the following information corresponding to rule r. 1. P and N : the numbers of positive and negative examples satisfying r s body. 2. P (p) and ....

J. Gehrke, R. Ramakrishnan, and V. Ganti. RainForest: A Framework for Fast Decision Tree Construction of Large Datasets. In VLDB'98, pp. 416-427, New York, NY, Aug. 1998.


On Effective Classification of Strings with Wavelets - Aggarwal   (Correct)

....often arises in the context of customer profiling, target marketing, medical diagnosis, and speech recognition. Examples of techniques which are often used for classification in the data mining domain include decision trees, rule based classifiers, nearest neighbor techniques and neural networks [5, 6, 7, 8]. A detailed survey of classification methods may be found in [8] The string domain provides some interesting applications of the classification problem. An important example is the biological domain in which large amounts of data have become available in the last few years. Applications of DNA ....

J. Gehrke, R. Ramakrishnan, V. Ganti. Rainforest- A Framework for Fast Decision Tree Construction of Large Data Sets. VLDB Conference, 1998.


SQL Database Primitives for Decision Tree Classifiers - Sattler, Dunemann (2001)   (1 citation)  (Correct)

....API like OLE DB for Data Mining [17] or user defined types and methods as proposed for SQL MM Part 6. 3) Finally, a DBMS could provide special operators or primitives, which are generally useful for data mining but not implementing a particular data mining task, e.g. the AVC sets described in [9]. The advantage of approach (3) is the usefulness for a broader range of data mining functions and obviously, both the language and the API approaches could benefit from such primitives. Moreover, if we consider the complexity of the SQL 99 standard and the extent of the features currently ....

....seems to be the most promising approach. In this paper we present results of our work on database primitives for decision tree classifiers. Classification is an important problem in data mining and well studied in the literature. Furthermore, there are proposals for classifier operations, e.g. [9], which form the basis for our work. We extend the idea of computing AVC groups or CC tables respectively to implement a SQL operator for a commercial DBMS. We evaluate the benefit of multi dimensional hashing for speeding up partial match queries, which are typical queries in decision tree ....

[Article contains additional citation context not shown here]

J. Gehrke, R. Ramakrishnan, and V. Ganti. RainForest - A Framework for Fast Decision Tree Construction of Large Datasets. In A. Gupta, O. Shmueli, and J. Widom, editors, Proc. VLDB'98, New York, USA, pages 416--427. Morgan Kaufmann, 1998.


Hierarchical Classification of Documents with Error Control - Cheng, Tang, Fu, King (2001)   (4 citations)  (Correct)

....expensive. As our algorithms can be faster than flat classification at a taxonomy of as low as four levels (Data Four) they represent a good trade off between speed and accuracy for most applications. 5 Related Work and Conclusion Classification has been studied extensively in the last decades [2, 3, 9, 13,15, 14, 17, 21]. However, most of the work on the classification ignores the hierarchical structure of classes. In [1] the authors explore the hierarchical structure of attributes to improve the efficiency, but assume only a single level of classes. The work reported in [4, 12] propose hierarchical ....

J. Gehrke, R. Ramakrishnan and V. Ganti, "Rainforest - a framework for fast decision tree construction of large datasets", Proc. of VLDB, 1998, pp 416-427.


Efficient Algorithms for Constructing Decision Trees with.. - Bell (2000)   (2 citations)  (Correct)

....and real life data sets. Finally, Section 6 concludes the paper. 2 Preliminaries In this section, we present a brief overview of the building and pruning phases of a traditional decision tree classifier. More detailed descriptions of existing decision tree induction algorithms can be found in [6, 12, 23, 27]. 2.1 Tree Building Phase The overall algorithm for growing a decision tree classifier is depicted in Figure 2(a) Basically, the tree is built breadth first by recursively partitioning the data until each partition is pure (i.e. it only contains records belonging to the same class) The ....

Johannes Gehrke, Raghu Ramakrishnan, and Venkatesh Ganti. "RainForest - A Framework for Fast Decision Tree Construction of Large Datasets". In Proceedings of the 24th International Conference on Very Large Data Bases, New York, USA, August 1998.


Clustering Through Decision Tree Construction - Liu, Xia, Yu (2000)   (7 citations)  (Correct)

....When the dataset is too large, techniques from the database community can be used to scale up the algorithm so that the entire dataset is not required in memory. 4] introduces an interval classifier that uses data indices to efficiently retrieve portions of data. SPRINT [27] and RainForest [18] propose two scalable techniques for decision tree building. For example, RainForest only keeps an AVC set (attribute value, classLabel and count) for each attribute in memory. This is sufficient for tree building and gain evaluation. It eliminates the need to have the entire dataset in memory. ....

....8 end 9 end 10 bestCut = d i cut3 (or d i cut2 if there is no d i cut3) of dimension i whose r density i is the minimal among the d dimensions. Figure 10. Determining the best cut in CLTree This algorithm can also be scaled up using the existing decision tree scale up techniques in [18, 27] since the essential computation here is the same as that in decision tree building, i.e. the gain evaluation. Our new criterion simply performs the gain evaluation more than once. 3. User Oriented Pruning of Cluster Trees The recursive partitioning method of building cluster trees will divide ....

J. Gehrke, R. Ramakrishnan, V. Ganti. "RainForest - A framework for fast decision tree construction of large datasets." VLDB-98, 1998.


Clustering Through Decision Tree Construction - Liu, Xia, Yu (2000)   (7 citations)  (Correct)

....tree algorithms: Traditionally, a decision tree algorithm requires the whole data to reside in memory. When the dataset is too large, techniques from the database community can be used to scale up the algorithm so that the entire dataset is not required in memory. SPRINT [28] and RainForest [17] propose two scalable techniques for decision tree building. For example, RainForest only keeps an AVC set (attribute value, classLabel and count) for each attribute in memory. This is sufficient for tree building and gain evaluation. It eliminates the need to have the entire dataset in memory. ....

....the overall best cut is because it is desirable to split at the cut point that may result in a big empty (N) region (e.g. between d 2 cut2 and d 2 cut3) which is more likely to separate clusters. Our algorithm can also be scaled up using the existing decision tree scale up techniques in [17, 18, 28] since the essential computation here is the same as that in decision tree building, i.e. the gain evaluation. Our new criterion simply performs the gain evaluation more than once. See [25] for details on the scale up. 3. USER ORIENTED PRUNING OF CLUSTER TREES The recursive partitioning method ....

J. Gehrke, R. Ramakrishnan, V. Ganti. "RainForest - A framework for fast decision tree construction of large datasets." VLDB-98, 1998.


Constructing Classification Trees with Exception Annotations for.. - Li (1999)   (Correct)

....large datasets, its classification accuracy is not as high as that of the classifier directly built using all of the data at once. Recent studies on classification in data mining contribute more towards the handling of large databases, such as SLIQ [30] SPRINT [45] PUBLIC [40] and RainForest [19]. The SLIQ [30] and SPRINT [45] algorithms handle disk resident datasets that are too large to fit in memory. Both algorithms generate binary classification trees using the gini index for attribute splitting and MDL pruning [31] Both SLIQ and SPRINT define the use of new data structures to ....

....not require rewriting the lists during a split. Instead, a pointer field of the class list is simply modified. When the class list is too large to fit in memory, SPRINT shows superior performance. The combination of these two algorithms has been suggested for improved scalability [45] RainForest [19] proposes a framework that separates the scalability aspect of classification tree induction from the splitting and pruning criteria. This generic algorithm CHAPTER 2. RELATED WORK 13 is based on the following two facts. First, almost all of the algorithms for classification tree induction in the ....

J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 416--427, New York, NY, August 1998.


A Framework for Measuring Changes in Data Characteristics - Venkatesh Ganti Johannes (1999)   (23 citations)  Self-citation (Gehrke Ramakrishnan Ganti)   (Correct)

No context found.

Johannes Gehrke, Raghu Ramakrishnan, and Venkatesh Ganti. Rainforest - a framework for fast decision tree construction of large datasets. In Ashish Gupta, Oded Shmueli, and Jennifer Widom, editors, Proceedings pages 416--427, New York, New York, August 1998. Morgan Kaufmann.


SECRET: A Scalable Linear Regression Tree Algorithm - Dobra, Gehrke (2002)   (194 citations)  Self-citation (Gehrke)   (Correct)

.... problem allows us to avoid forming and solving the large number of linear systems of equations required for an exhaustive search method such as the method used by RETIS [9] Even more, scalable versions of the EM algorithm for Gaussian mixtures [2] and classification tree construction [7] can be used to improve the scalability of the proposed solution. An extra benefit of the method is the fact that good oblique splits can be easily obtained. The rest of the paper is organized as follows. In Section 2 we give short introductions to classification and regression tree construction ....

....two classes, the prediction is only slightly improved. The solution adopted by Chaudhuri et al. is to use Quadratic Discriminant Analysis (QDA) to determine the split point. 4. SECRET For constant regression trees, algorithms for scalable classification trees can be straightforwardly adapted [7]. The main obstacle in doing a similar adaptation for linear regression trees is that the problem of partitioning the domain of a discrete variable in two parts is intractable. Also the amount of su#cient statistics that has to be maintained for the split increases drastically: For constant ....

J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest -- a framework for fast decision tree construction of large datasets. In Proceedings of the 24th International Conference on Very Large Databases, pages 416--427. Morgan Kaufmann, August 1998.


A Framework for Measuring Differences in Data.. - Ganti, Ramakrishnan.. (1999)   (2 citations)  Self-citation (Gehrke Ramakrishnan Ganti)   (Correct)

.... 1 Introduction The goal of data mining is to discover (predictive) models based on the data maintained in the database [FPSSU96] Several algorithms have been proposed for computing novel models [AGGR98, AIS93, AMS 96, MAR96, NH94] for more efficient model construction [BMUT97, EKX95, GRG98, GKR98, GRS98, PCY95, RS98, SON95, SAM96, ZRL96] and to deal with new data types [GRG 99, GKR98, GRS99, GGR99] There is, however, no work addressing the important issue of how to measure the difference, or deviation, between two models. As a motivating example, consider the following ....

....selected four functions (Functions F1, F2, F3, and F4) for our performance study. We use NM.Fnum to denote a dataset with N million tuples generated using classification function num. We used a scalable version of the widely studied CART [BFOS84] algorithm implemented in the RainForest framework [GRG98] to construct decision tree models. We used (fa ;gsum ) to compute the deviation between two models. Table 2 shows the significance of the decrease in sample deviations for the dataset 1M.F1 as the sample size is increased. The significance is measured using the Wilcoxon test on sets of 50 ....

Johannes Gehrke, Raghu Ramakrishnan, and Venkatesh Ganti. Rainforest -- a framework for fast decision tree construction of large datasets. In Proceedings of the 24th International Conference on Very Large Databases, pages 416--427. Morgan Kaufmann, August 1998.


CrossMine: Efficient Classification Across Multiple Database .. - Xiaoxin Yin Uiuc (2004)   (1 citation)  (Correct)

No context found.

J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree construction of large datasets. In Proc. 1998.


Mining Data Streams Using Option Trees - Holmes, Kirkby, Pfahringer (2004)   (Correct)

No context found.

Johannes Gehrke, Raghu Ramakrishnan, and Venkatesh Ganti. Rainforest - a framework for fast decision tree construction of large datasets. Data Mining and Knowledge Discovery, 4(2/3):127--162, 2000.


Trends in Data Mining and Knowledge Discovery - Kurgan (2005)   (1 citation)  (Correct)

No context found.

Gehrke, J., Ramakrishnan, R., and Ganti, V., RainForest - a Framework for Fast Decision Tree Construction of Large Datasets, Proceedings of the 24th International Conference on Very Large Data Bases, San Francisco, pp. 416-427, 1998


Shared Memory Parallelization of Data Mining Algorithms.. - Jin, Yang, Agrawal (2004)   (1 citation)  (Correct)

No context found.

J. Gehrke, R. Ramakrishnan, and V. Ganti, "Rainforest---A Framework for Fast Decision Tree Construction of Large Datasets," Proc. Conf. Very Large Databases (VLDB), 1998.


Communication and Memory Efficient Parallel Decision Tree.. - Jin, Agrawal (2003)   (Correct)

No context found.

J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest - a framework for fast decision tree construction of large datasets. In VLDB, 1998.


Combi-Operator - Database Support for Data Mining.. - Hinneburg, Habich, Lehner (2003)   (Correct)

No context found.

Johannes Gehrke, Raghu Ramakrishnan, and Venkatesh Ganti. Rainforest - a framework for fast decision tree construction of large datasets. In VLDB'98, Proceedings of 24rd International Conference on Very Large Data Bases, August 24-27, 1998.


Shared Memory Parallelization of Data Mining Algorithms.. - Jin, Agrawal (2002)   (1 citation)  (Correct)

No context found.

J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest - a framework for fast decision tree construction of large datasets. In VLDB, 1996.


CrossMine: Efficient Classification Across Multiple.. - Yin, Han, Yang, Yu (2004)   (1 citation)  (Correct)

No context found.

J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree construction of large datasets. In Proc. 1998.


A Novel Evolutionary Data Mining Algorithm - With Applications To   (Correct)

No context found.

J. Gehrke, R. Ramakrishnan, and V. Ganti, "RainForest -- A framework for fast decision tree construction of large datasets," in Proc. 24th Int. Conf. Very Large Data Bases, New York, 1998, pp. 416--427.


FARMER: Finding Interesting Rule Groups in Microarray - Cong (2004)   (Correct)

No context found.

J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases (VLDB'98).


Efficient Decision Tree Construction on Streaming Data - Jin, Agrawal (2003)   (Correct)

No context found.

J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest - a framework for fast decision tree construction of large datasets. In VLDB, 1998.


Thesis Proposal - Ruoming Jin Department   (Correct)

No context found.

J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest - a framework for fast decision tree construction of large datasets. In VLDB, 1998.


Hierarchical Classification of Documents with Error Control - Cheng, Tang, Fu, King (2001)   (4 citations)  (Correct)

No context found.

J. Gehrke, R. Ramakrishnan and V. Ganti, "Rainforest - a framework for fast decision tree construction of large datasets", Proc. of VLDB, 1998, pp 416-427.

First 50 documents

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC