Results 1 
9 of
9
Automating the Construction of Internet Portals with Machine Learning
 Information Retrieval
, 2000
"... Domainspecific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible ..."
Abstract

Cited by 209 (4 self)
 Add to MetaCart
Domainspecific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible with general, Webwide search engines. Unfortunately these portals are difficult and timeconsuming to maintain. This paper advocates the use of machine learning techniques to greatly automate the creation and maintenance of domainspecific Internet portals. We describe new research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies. Using these techniques, we have built a demonstration system: a portal for computer science research papers. It already contains over 50,000 papers and is publicly available at www.cora.justresearch.com. These techniques are ...
Using reinforcement learning to spider the Web efficiently
 Proceedings of the 16th International Conference on Machine Learning (ICML)
, 1999
"... Consider the task of exploring the Web in order to find pages of a particular kind or on a particular topic. This task arises in the construction of search engines and Web knowledge bases. This paper argues that the creation of efficient web spiders is best framed and solved by reinforcement learnin ..."
Abstract

Cited by 128 (4 self)
 Add to MetaCart
Consider the task of exploring the Web in order to find pages of a particular kind or on a particular topic. This task arises in the construction of search engines and Web knowledge bases. This paper argues that the creation of efficient web spiders is best framed and solved by reinforcement learning, a branch of machine learning that concerns itself with optimal sequential decision making. One strength of reinforcement learning is that it provides a formalism for measuring the utility of actions that give benefit only in the future. We present an algorithm for learning a value function that maps hyperlinks to future discounted reward using a naive Bayes text classifier. Experiments on two realworld spidering tasks show a threefold improvement in spidering efficiency over traditional breadthfirst search, and up to a twofold improvement over reinforcement learning with immediate reward only.
A Machine Learning Approach to Building DomainSpecific Search Engines
 In Proceedings of the 16th International Joint Conference on Artificial Intelligence
, 1999
"... Domainspecific search engines are becoming increasingly popular because they offer increased accuracy and extra features not possible with general, Webwide search engines. Unfortunately, they are also difficult and timeconsuming to maintain. This paper proposes the use of machine learning techniq ..."
Abstract

Cited by 96 (5 self)
 Add to MetaCart
Domainspecific search engines are becoming increasingly popular because they offer increased accuracy and extra features not possible with general, Webwide search engines. Unfortunately, they are also difficult and timeconsuming to maintain. This paper proposes the use of machine learning techniques to greatly automate the creation and maintenance of domainspecific search engines. We describe new research in reinforcement learning, text classification and information extraction that enables efficient spidering, populates topic hierarchies, and identifies informative text segments. Using these techniques, we have built a demonstration system: a search engine for computer science research papers available at www.cora.justresearch.com.
Inductive Learning of Treebased Regression Models
, 1999
"... This thesis explores different aspects of the induction of treebased regression models from data. The main goal of this study is to improve the predictive accuracy of regression trees, while retaining as much as possible their comprehensibility and computational efficiency. Our study is divided in ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
This thesis explores different aspects of the induction of treebased regression models from data. The main goal of this study is to improve the predictive accuracy of regression trees, while retaining as much as possible their comprehensibility and computational efficiency. Our study is divided in three main parts. In the first part we describe in detail two different methods of growing a regression tree: minimising the mean squared error and minimising the mean absolute deviation. Our study is particularly focussed on the computational efficiency of these tasks. We present several new algorithms that lead to significant computational speed ups. We also describe an experimental comparison of both methods of growing a regression tree highlighting their different application goals. Pruning is a standard procedure within treebased models whose goal is to provide a good compromise for achieving simple and comprehensible models with good predictive accuracy. In the second part of our study we describe a series of new techniques for pruning by selection from a series of alternative pruned trees. We carry out an extensive set of experiments comparing different methods of pruning, which show that our proposed techniques are able to significantly outperform the predictive accuracy of current state of the art pruning algorithms in a large set of regression domains. In the final part of our study we present a new type of treebased models that we refer to as local regression trees. These hybrid models integrate treebased regression with local modelling techniques. We describe different types of local regression trees and show that these models are able to significantly outperform standard regression trees in terms of predictive accuracy. Through a large set of experiments we prove the competitiveness of local regression trees when compared to existing regression techniques.
Software Defect Prediction Using Regression via Classification
"... In this paper we apply a machine learning approach to the problem of estimating the number of defects called Regression via Classification (RvC). RvC initially automatically discretizes the number of defects into a number of fault classes, then learns a model that predicts the fault class of a softw ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
In this paper we apply a machine learning approach to the problem of estimating the number of defects called Regression via Classification (RvC). RvC initially automatically discretizes the number of defects into a number of fault classes, then learns a model that predicts the fault class of a software system. Finally, RvC transforms the class output of the model back into a numeric prediction. This approach includes uncertainty in the models because apart from a certain number of faults, it also outputs an associated interval of values, within which this estimate lies, with a certain confidence. To evaluate this approach we perform a comparative experimental study of the effectiveness of several machine learning algorithms in a software dataset. The data was collected by Pekka Forselious and involves applications maintained by a bank of Finland.
NAÏVE BAYSIAN CLASSIFIER FOR ONLINE REMAINING USEFUL LIFE PREDICTION OF DEGRADING BEARINGS
"... In this paper, the estimation of the Residual Useful Life (RUL) of degraded thrust ball bearings is made resorting to a datadriven stochastic approach that relies on an iterative Naïve Bayesian Classifier (NBC) for regression task. NBC is a simple stochastic classifier based on applying Bayes ’ the ..."
Abstract
 Add to MetaCart
(Show Context)
In this paper, the estimation of the Residual Useful Life (RUL) of degraded thrust ball bearings is made resorting to a datadriven stochastic approach that relies on an iterative Naïve Bayesian Classifier (NBC) for regression task. NBC is a simple stochastic classifier based on applying Bayes ’ theorem for posterior estimate updating. Indeed, the implemented iterative procedure allows for updating the RUL estimation based on new information collected by sensors located on the degrading bearing, and is suitable for an online monitoring of the component health status. The feasibility of the approach is shown with respect to real world vibrationbased degradation data.
Chapter 6 Conc l us i ons
"... Least squares (LS) regression trees generated with this simplification are very efficient in computational terms. These techniques can easily handle data sets with hundreds of thousands of cases, in few seconds. In effect, our simulation studies confirmed a clearly linear dependence of the computati ..."
Abstract
 Add to MetaCart
Least squares (LS) regression trees generated with this simplification are very efficient in computational terms. These techniques can easily handle data sets with hundreds of thousands of cases, in few seconds. In effect, our simulation studies confirmed a clearly linear dependence of the computation time on the number of cases. This can be regarded as a crucial property when facing large regression problems, which was the main motivation behind our study of LS trees. With respect to least absolute deviation (LAD) trees we have presented a theoretical analysis of this methodology, leading to a series of algorithms that ensure high computational efficiency in the task of finding the best split for each tree node. We have also attempted to prove that a theorem by Breiman et al. (1984) concerning the issue of finding the best split for discrete variables was also applicable to LAD trees. Although we were not able to obtain a proof of its validity we have encountered a co
Chapter 6 Conclusions
, 199
"... ares (LS) regression trees generated with this simplification are very efficient in computational terms. These techniques can easily handle data sets with hundreds of thousands of cases, in few seconds. In effect, our simulation studies confirmed a clearly linear dependence of the computation time o ..."
Abstract
 Add to MetaCart
ares (LS) regression trees generated with this simplification are very efficient in computational terms. These techniques can easily handle data sets with hundreds of thousands of cases, in few seconds. In effect, our simulation studies confirmed a clearly linear dependence of the computation time on the number of cases. This can be regarded as a crucial property when facing large regression problems, which was the main motivation behind our study of LS trees. With respect to least absolute deviation (LAD) trees we have presented a theoretical analysis of this methodology, leading to a series of algorithms that ensure high computational efficiency in the task of finding the best split for each tree node. We have also attempted to prove that a theorem by Breiman et al. (1984) concerning the issue of finding the best split for discrete variables was also applicable to LAD trees. Although we were not able to obtain a proof of its validity we have encountered a counterexam
ABSTRACT Solving Regression Problems with Rulebased Ensemble Classifiers
"... We describe a lightweight learning method that induces an ensemble of decisionrule solutions for regression problems. Instead of direct prediction of a continuous output variable, the method discretizes the variable by kmeans clustering and solves the resultant classification problem. Predictions ..."
Abstract
 Add to MetaCart
(Show Context)
We describe a lightweight learning method that induces an ensemble of decisionrule solutions for regression problems. Instead of direct prediction of a continuous output variable, the method discretizes the variable by kmeans clustering and solves the resultant classification problem. Predictions on new examples are made by averaging the mean values of classes with votes that are close in number to the most likely class. We provide experimental evidence that this indirect approach can often yield strong results for many applications, generally outperforming direct approaches such as regression trees and rivaling bagged regression trees.