Results 1 - 10
of
17
Compression-Based Discretization of Continuous Attributes
- Proceedings of the 12th International Conference on Machine Learning
, 1995
"... Discretization of continuous attributes into ordered discrete attributes can be beneficial even for propositional induction algorithms that are capable of handling continuous attributes directly. Benefits include possibly large improvements in induction time, smaller sizes of induced trees or rule s ..."
Abstract
-
Cited by 38 (0 self)
- Add to MetaCart
Discretization of continuous attributes into ordered discrete attributes can be beneficial even for propositional induction algorithms that are capable of handling continuous attributes directly. Benefits include possibly large improvements in induction time, smaller sizes of induced trees or rule sets, and even improved predictive accuracy. We define a global evaluation measure for discretizations based on the so-called Minimum Description Length (MDL) principle from information theory. Furthermore we describe the efficient algorithmic usage of this measure in the MDL-Disc algorithm. The new method solves some problems of alternative local measures used for discretization. Empirical results in a few natural domains and extensive experiments in an artificial domain show that MDL-Disc scales up well to large learning problems involving noise. 1 Financial support for the Austrian Research Institute for Artificial Intelligence is provided by the Austrian Federal Ministry of Science and...
Zeta: A Global Method for Discretization of Continuous Variables
, 1997
"... Discretization of continuous variables so they may be used in conjunction with machine learning or statistical techniques that require nominal data is an important problem to be solved in developing generally applicable methods for data mining. This paper introduces a new technique for discretizatio ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
Discretization of continuous variables so they may be used in conjunction with machine learning or statistical techniques that require nominal data is an important problem to be solved in developing generally applicable methods for data mining. This paper introduces a new technique for discretization of such variables based on zeta, a measure of strength of association between nominal variables developed for this purpose. Following a review of existing techniques for discretization we define zeta, a measure based on minimisation of the error rate when each value of an independent variable must predict a different value of a dependent variable. We then describe both how a continuous variable may be dichotomised by searching for a maximum value of zeta, and how a heuristic extension of this method can partition a continuous variable into more than two categories. A series of experimental evaluations of zeta-discretization, including comparisons with other published methods, show that zet...
A Comparative Study of Discretization Methods for Naive-Bayes Classifiers
- In Proceedings of PKAW 2002: The 2002 Pacific Rim Knowledge Acquisition Workshop
, 2002
"... Discretization is a popular approach to handling numeric attributes in machine learning. We argue that the requirements for effective discretization differ between naive-Bayes learning and many other learning algorithms. We evaluate the effectiveness with naive-Bayes classifiers of nine discretizati ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Discretization is a popular approach to handling numeric attributes in machine learning. We argue that the requirements for effective discretization differ between naive-Bayes learning and many other learning algorithms. We evaluate the effectiveness with naive-Bayes classifiers of nine discretization methods, equal width discretization (EWD), equal frequency discretization (EFD), fuzzy discretization (FD), entropy minimization discretization (EMD), iterative discretization (ID), proportional k-interval discretization (PKID), lazy discretization (LD), nondisjoint discretization (NDD) and weighted proportional k-interval discretization (WPKID). It is found that in general naive-Bayes classifiers trained on data preprocessed by LD, NDD or WPKID achieve lower classification error than those trained on data preprocessed by the other discretization methods. But LD can not scale to large data. This study leads to a new discretization method, weighted non-disjoint discretization (WNDD) that combines WPKID and NDD's advantages. Our experiments show that among all the rival discretization methods, WNDD best helps naive-Bayes classifiers reduce average classification error.
Intelligent Data Analysis in Medicine
, 2000
"... Extensive amounts of knowledge and data stored in medical databases require the development of specialized tools for storing and accessing of data, data analysis, and effective use of stored knowledge and data. This paper focuses on methods and tools for intelligent data analysis, aimed at narrow ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Extensive amounts of knowledge and data stored in medical databases require the development of specialized tools for storing and accessing of data, data analysis, and effective use of stored knowledge and data. This paper focuses on methods and tools for intelligent data analysis, aimed at narrowing the increasing gap between data gathering and data comprehension. The paper sketches the history of research that led to the development of current intelligent data analysis techniques, discusses the need for intelligent data analysis in medicine, and proposes a classification of intelligent data analysis methods. The scope of the paper covers temporal data abstraction methods and data mining methods. A selection of methods is presented and illustrated in medical problem domains. Presently data abstraction and data mining are attracting considerable research interest. However the two technologies, in spite of the fact that they share their central objective, namely the intelligen...
Induction of decision trees and Bayesian classification applied to diagnosis of sport injuries
, 1997
"... Machine learning techniques can be used to extract knowledge from data stored in medical databases. In our application, various machine learning algorithms were used to extract diagnostic knowledge to support the diagnosis of sport injuries. The applied methods include variants of the Assistant algo ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Machine learning techniques can be used to extract knowledge from data stored in medical databases. In our application, various machine learning algorithms were used to extract diagnostic knowledge to support the diagnosis of sport injuries. The applied methods include variants of the Assistant algorithm for top-down induction of decision trees, and variants of the Bayesian classifier. The available dataset was insufficent for reliable diagnosis of all sport injuries considered by the system. Consequently, expert-defined diagnostic rules were added and used as pre-classifiers or as generators of additional training instances for injuries with few training examples. Experimental results show that the classification accuracy and the explanation capability of the naive Bayesian classifier with the fuzzy discretization of numerical attributes was superior to other methods and was estimated as the most appropriate for practical use. 1 Introduction Machine learning technology is well suited...
Concurrent Discretization of Multiple Attributes
- In Proceedings of the Pacific Rim International Conference on Artificial Intelligence
, 1998
"... . Better decision trees can be learnt by merging continuous values into intervals. Merging of values, however, could introduce inconsistencies to the data, or information loss. When it is desired to maintain a certain consistency, interval mergings in one attribute could disable those in another ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
. Better decision trees can be learnt by merging continuous values into intervals. Merging of values, however, could introduce inconsistencies to the data, or information loss. When it is desired to maintain a certain consistency, interval mergings in one attribute could disable those in another attribute. This interaction raises the issue of determining the order of mergings. We consider a globally greedy heuristic that selects the "best" merging from all continuous attributes at each step. We present an implementation of the heuristic in which the best merging is determined in a time independent of the number of possible mergings. Experiments show that intervals produced by the heuristic lead to improved decision trees. 1 Introduction 1.1 Motivation Continuous values, mainly reals and integers, are linearly ordered. Unlike discrete values, there could be many continuous values and each appears only a few times in the data. Directly applying induction algorithms designed ...
A global optimal algorithm for class-dependent discretization of continuous data”, Intelligent Data Analysis 8,2004
"... This paper presents a new method to convert continuous variables into discrete variables for inductive machine learning. The method can be applied to pattern classification problems in machine learning and data mining. The discretization process is formulated as an optimization problem. We first use ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
This paper presents a new method to convert continuous variables into discrete variables for inductive machine learning. The method can be applied to pattern classification problems in machine learning and data mining. The discretization process is formulated as an optimization problem. We first use the normalized mutual information that measures the interdependence between the class labels and the variable to be discretized as the objective function, and then use fractional programming (iterative dynamic programming) to find its optimum. Unlike the majority of class-dependent discretization methods in the literature which only find the local optimum of the objective functions, the proposed method, OCDD, or Optimal Class-Dependent Discretization, finds the global optimum. The experimental results demonstrate that this algorithm is very effective in classification when coupled with popular learning systems such as C4.5 decision trees and Naive-Bayes classifier. It can be used to discretize continuous variables for many existing inductive learning systems. 1
ADHOC: A tool for performing effective feature selection
- In Proceedings of the International Conference on Tools with Artificial Intelligence
, 1996
"... This paper introduces ADHOC, a tool that integrates statistical methods and machine learning techniques to perform effective feature selection. Feature selection plays a central role in the data analysis process since redundant and irrelevant features often degrade the performance of induction algor ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This paper introduces ADHOC, a tool that integrates statistical methods and machine learning techniques to perform effective feature selection. Feature selection plays a central role in the data analysis process since redundant and irrelevant features often degrade the performance of induction algorithms, both in speed and predictive accuracy. ADHOC combines the advantages of both filter and feedback approaches to feature selection to enhance the understanding of the given data and increase the efficiency of the feature selection process. We report results of extensive experiments on realworld data which demonstrate the effectiveness of AD-HOC as data reduction technique as well as feature selection method. ADHOC has been employed in the analysis of several corporate databases. In particular, it is currently used to support the difficult task of early estimating the cost of software projects. 1.
KDCOM: A Knowledge Discovery Component Framework
, 1998
"... and Poster Session. In Section 5.3 we indicate that rule interest measures up to date are not accurate and introduce a necessary condition for interestingness. This work has also been accepted for presentation at the Fifteenth National Conference on Artificial Intelligence (AAAI-98) Student Abstract ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
and Poster Session. In Section 5.3 we indicate that rule interest measures up to date are not accurate and introduce a necessary condition for interestingness. This work has also been accepted for presentation at the Fifteenth National Conference on Artificial Intelligence (AAAI-98) Student Abstract and Poster Session. In Section 5.4 we introduce the idea of fuzzy metaquery and justify that it can be useful as a integration tool for knowledge discovery systems. This work has been presented at the Sixth IEEE International Conference on Fuzzy Systems (FUZZIEE-97). Finally, in Section 5.5 we explain two components that we are actually developing. 5.1 Distance-Based Discretization Discretization is a process that transforms continuous attributes into discrete ones. Performing this previous step, we can apply discrete classification methods to datasets containing continuous values. We have developed a discretization method, based on the idea of distance between partitions. In this section...

