Results 1 - 10
of
537
Discovering Models of Software Processes from Event-Based Data
- ACM Transactions on Software Engineering and Methodology
, 1998
"... this article we describe a Markov method that we developed specifically for process discovery, as well as describe two additional methods that we adopted from other domains and augmented for our purposes. The three methods range from the purely algorithmic to the purely statistical. We compare the m ..."
Abstract
-
Cited by 321 (8 self)
- Add to MetaCart
this article we describe a Markov method that we developed specifically for process discovery, as well as describe two additional methods that we adopted from other domains and augmented for our purposes. The three methods range from the purely algorithmic to the purely statistical. We compare the methods and discuss their application in an industrial case study.
STING: A statistical information grid approach to spatial data mining
, 1997
"... Spatial data mining, i.e., discovery of interesting characteristics and patterns that may implicitly exist in spatial databases, is a challenging task due to the huge amounts of spatial data and to the new conceptual nature of the problems which must account for spatial distance. Clustering and regi ..."
Abstract
-
Cited by 290 (10 self)
- Add to MetaCart
Spatial data mining, i.e., discovery of interesting characteristics and patterns that may implicitly exist in spatial databases, is a challenging task due to the huge amounts of spatial data and to the new conceptual nature of the problems which must account for spatial distance. Clustering and region oriented queries are common problems in this domain. Several approaches have been presented in recent years, all of which require at least one scan of all individual objects (points). Consequently, the computational complexity is at least linearly proportional to the number of objects to answer each query. In this paper, we propose a hierarchical statistical information grid based approach for spatial data mining to reduce the cost further. The idea is to capture statistical information associated with spatial cells in such a manner that whole classes of queries and clustering problems can be answered without recourse to the individual objects. In theory, and confirmed by empirical studies, this approach outperforms the best previous method by at least an order of magnitude, especially when the data set is very large.
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
- DATA MINING AND KNOWLEDGE DISCOVERY
, 1998
"... The problem of merging multiple databases of information about common entities is frequently encountered in KDD and decision support applications in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and ..."
Abstract
-
Cited by 250 (0 self)
- Add to MetaCart
The problem of merging multiple databases of information about common entities is frequently encountered in KDD and decision support applications in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and accuracy. Large repositories of data typically have numerous duplicate information entries about the same entities that are difficult to cull together without an intelligent "equational theory" that identifies equivalent items by a complex, domain-dependent matching process. We have developed a system for accomplishing this Data Cleansing task and demonstrate its use for cleansing lists of names of potential customers in a direct marketing-type application. Our results for statistically generated data are shown to be accurate and effective when processing the data multiple times using different keys for sorting on each successive pass. Combing results of individual passes using transitive c...
The Dimensional Fact Model: A Conceptual Model For Data Warehouses
- International Journal of Cooperative Information Systems
, 1998
"... this paper we<E-382> formalize a graphical conceptual model for data warehouses, called Dimensional Fact model, and<E-380> propose a semi-automated methodology to build it from the pre-existing (conceptual or logical)<E-366> schemes describing the enterprise relational database. Th ..."
Abstract
-
Cited by 158 (21 self)
- Add to MetaCart
this paper we<E-382> formalize a graphical conceptual model for data warehouses, called Dimensional Fact model, and<E-380> propose a semi-automated methodology to build it from the pre-existing (conceptual or logical)<E-366> schemes describing the enterprise relational database. The representation of reality built using our<E-381> conceptual model consists of a set of fact schemes whose basic elements are facts, measures,<E-358> attributes, dimensions and hierarchies; other features which may be represented on fact schemes are<E-382> the additivity of fact attributes along dimensions, the optionality of dimension attributes and the<E-381> existence of non-dimension attributes. Compatible fact schemes may be overlapped in order to relate<E-373> and compare data for drill-across queries. Fact schemes should be integrated with information of the<E-382> conjectured workload, to be used as the input of logical and physical design phases; to this end, we<E-382> propose a simple language to denote data warehouse queries in terms of sets of fact instances.<E-334>
Overview of record linkage and current research directions
- BUREAU OF THE CENSUS
, 2006
"... This paper provides background on record linkage methods that can be used in combining data from a variety of sources such as person lists business lists. It also gives some areas of current research. ..."
Abstract
-
Cited by 139 (1 self)
- Add to MetaCart
This paper provides background on record linkage methods that can be used in combining data from a variety of sources such as person lists business lists. It also gives some areas of current research.
Data Mining in Soft Computing Framework: A Survey
- IEEE Transactions on Neural Networks
, 2001
"... The present article provides a survey of the available literature on data mining using soft computing. A categorization has been provided based on the different soft computing tools and their hybridizations used, the data mining function implemented, and the preference criterion selected by the mode ..."
Abstract
-
Cited by 109 (3 self)
- Add to MetaCart
(Show Context)
The present article provides a survey of the available literature on data mining using soft computing. A categorization has been provided based on the different soft computing tools and their hybridizations used, the data mining function implemented, and the preference criterion selected by the model. The utility of the different soft computing methodologies is highlighted. Generally fuzzy sets are suitable for handling the issues related to understandability of patterns, incomplete/noisy data, mixed media information and human interaction, and can provide approximate solutions faster. Neural networks are nonparametric, robust, and exhibit good learning and generalization capabilities in data-rich environments. Genetic algorithms provide efficient search algorithms to select a model, from mixed media data, based on some preference criterion/objective function. Rough sets are suitable for handling different types of uncertainty in data. Some challenges to data mining and the application of soft computing methodologies are indicated. An extensive bibliography is also included.
Using general impressions to analyze discovered classification rules
- Proc. 3rd Intl. Conf. on Knowledge Discovery & Data Mining (KDD-97
, 1997
"... One of the important problems in data mining is the evaluation of subjective interestingness of the discovered rules. Past research has found that in many real-life applications it is easy to generate a large number of rules from the database, but most of the rules are not useful or interesting to t ..."
Abstract
-
Cited by 106 (14 self)
- Add to MetaCart
One of the important problems in data mining is the evaluation of subjective interestingness of the discovered rules. Past research has found that in many real-life applications it is easy to generate a large number of rules from the database, but most of the rules are not useful or interesting to the user. Due to the large number of rules, it is difficult for the user to analyze them manually in order to identify those interesting ones. Whether a rule is of interest to a user depends on his/her existing knowledge of the domain, and his/her interests. In this paper, we propose a technique that analyzes the discovered rules against a specific type of existing knowledge, which we call general impressions, to help the user identify interesting rules. We first propose a representation language to allow general impressions to be specified. We then present some algorithms to analyze the discovered classification rules against a set of general impressions. The results of the analysis tell us which rules conform to the general impressions and which rules are unexpected. Unexpected rules are by definition interesting. 1.
TAILOR: A Record Linkage Toolbox
, 2002
"... Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the re ..."
Abstract
-
Cited by 90 (9 self)
- Add to MetaCart
(Show Context)
Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive Record Linkage Toolbox named TAILOR. Users of TAILOR can build their own record linkage models by tuning system parameters and by plugging in in-house developed and public domain tools. The proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to interface with existing and future record linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. Results show that the proposed machine learning record linkage models outperform the existing ones both in accuracy and in performance.
Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings
- IEEE Transactions On Software Engineering
"... Abstract—Software defect prediction strives to improve software quality and testing efficiency by constructing predictive classification models from code attributes to enable a timely identification of fault-prone modules. Several classification models have been evaluated for this task. However, due ..."
Abstract
-
Cited by 86 (1 self)
- Add to MetaCart
Abstract—Software defect prediction strives to improve software quality and testing efficiency by constructing predictive classification models from code attributes to enable a timely identification of fault-prone modules. Several classification models have been evaluated for this task. However, due to inconsistent findings regarding the superiority of one classifier over another and the usefulness of metric-based classification in general, more research is needed to improve convergence across studies and further advance confidence in experimental results. We consider three potential sources for bias: comparing classifiers over one or a small number of proprietary data sets, relying on accuracy indicators that are conceptually inappropriate for software defect prediction and cross-study comparisons, and, finally, limited use of statistical testing procedures to secure empirical findings. To remedy these problems, a framework for comparative software defect prediction experiments is proposed and applied in a large-scale empirical comparison of 22 classifiers over 10 public domain data sets from the NASA Metrics Data repository. Overall, an appealing degree of predictive accuracy is observed, which supports the view that metric-based classification is useful. However, our results indicate that the importance of the particular classification algorithm may be less than previously assumed since no significant performance differences could be detected among the top 17 classifiers. Index Terms—Complexity measures, data mining, formal methods, statistical methods, software defect prediction. Ç 1
Extracting Comprehensible Models from Trained Neural Networks
, 1996
"... To Mom, Dad, and Susan, for their support and encouragement. ..."
Abstract
-
Cited by 84 (3 self)
- Add to MetaCart
(Show Context)
To Mom, Dad, and Susan, for their support and encouragement.