Results 1 -
9 of
9
M.J.: InstanceBased Matching of Large Ontologies Using Locality-Sensitive Hashing
- In: Proceedings of the 11th International Semantic Web Conference, ISWC
, 2012
"... Abstract. In this paper, we describe a mechanism for ontology align-ment using instance based matching of types (or classes). Instance-based matching is known to be a useful technique for matching ontologies that have different names and different structures. A key problem in instance matching of ty ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
Abstract. In this paper, we describe a mechanism for ontology align-ment using instance based matching of types (or classes). Instance-based matching is known to be a useful technique for matching ontologies that have different names and different structures. A key problem in instance matching of types, however, is scaling the matching algorithm to (a) handle types with a large number of instances, and (b) efficiently match a large number of type pairs. We propose the use of state-of-the art locality-sensitive hashing (LSH) techniques to vastly improve the scala-bility of instance matching across multiple types. We show the feasibility of our approach with DBpedia and Freebase, two different type systems with hundreds and thousands of types, respectively. We describe how these techniques can be used to estimate containment or equivalence re-lations between two type systems, and we compare two different LSH techniques for computing instance similarity.
Information Theory For Data Management
"... We are awash in data. The explosion in computing power and computing infrastructure allows us to generate multitudes of data, in differing formats, at different scales, and in inter-related areas. Data management is fundamentally ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
(Show Context)
We are awash in data. The explosion in computing power and computing infrastructure allows us to generate multitudes of data, in differing formats, at different scales, and in inter-related areas. Data management is fundamentally
Sampling Dirty Data for Matching Attributes
, 2010
"... We investigate the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. However, real-world data sets are ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We investigate the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. However, real-world data sets are often ‘dirty’, especially when integrating data from different sources. To deal with this issue, we propose new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between strings instances. To make the measures effective, we develop efficient algorithms for distributed sample creation and similarity computation. Test results show that for dirty data our measures are more accurate for measuring value overlap than existing sample-based methods, but we also observe that there is a clear tradeoff between accuracy and speed. This motivates a two-stage filtering approach, with both measures operating on the same samples.
Type-Based Categorization of Relational Attributes
"... In this work we concentrate on categorization of relational attributes based on their data type. Assuming that attribute type/characteristics are unknown or unidentifiable, we analyze and compare a variety of type-based signatures for classifying the attributes based on the semantic type of the data ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
In this work we concentrate on categorization of relational attributes based on their data type. Assuming that attribute type/characteristics are unknown or unidentifiable, we analyze and compare a variety of type-based signatures for classifying the attributes based on the semantic type of the data contained therein (e.g., router identifiers, social security numbers, email addresses). The signatures can subsequently be used for other applications as well, like clustering and index optimization/compression. This application is useful in cases where very large data collections that are generated in a distributed, ungoverned fashion end up having unknown, incomplete, inconsistent or very complex schemata and schema level meta-data. We concentrate on heuristically generating type-based attribute signatures based on both local and global computation approaches. We show experimentally that by decomposing data into q-grams and then considering signatures based on q-gram distributions, we achieve very good classification accuracy under the assumption that a large sample of the data is available for building the signatures. Then, we turn our attention to cases where a very small sample of the data is available, and hence accurately capturing the q-gram distribution of a given data type is almost impossible. We propose techniques based on dimensionality reduction and soft-clustering that exploit correlations between attributes to improve classification accuracy. 1.
Schema-As-You-Go: On Probabilistic Tagging and Querying of Wide Tables
"... The emergence of Web 2.0 has resulted in a huge amount of heterogeneous data that are contributed by a large number of users, engendering new challenges for data management and query processing. Given that the data are unified from various sources and accessed by numerous users, providing users with ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
The emergence of Web 2.0 has resulted in a huge amount of heterogeneous data that are contributed by a large number of users, engendering new challenges for data management and query processing. Given that the data are unified from various sources and accessed by numerous users, providing users with a unified mediated schema as data integration is insufficient. On one hand, a deterministic mediated schema restricts users ’ freedom to express queries in their preferred vocabulary; on the other hand, it is not realistic for users to remember the numerous attribute names that arise from integrating various data sources. As such, a user-oriented data management and query interface is required. In this paper, we propose an out-of-the-box approach that separates users ’ actions from database operations. This sep-arating layer deals with the challenges from a semantic per-spective. It interprets the semantics of each data value through tags that are provided by users, and then inserts the value into the database together with these tags. When querying the database, this layer also serves as a platform for retrieving data by interpreting the semantics of the queried tags from the users. Experiments are conducted to illustrate both the effectiveness and efficiency of our approach.
Content-based ontology matching for GIS datasets
- Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
, 2008
"... ABSTRACT The alignment of separate ontologies by matching related concepts continues to attract great attention within the database and artificial intelligence communities, especially since semantic heterogeneity across data sources remains a widespread and relevant problem. In particular, the Geog ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
ABSTRACT The alignment of separate ontologies by matching related concepts continues to attract great attention within the database and artificial intelligence communities, especially since semantic heterogeneity across data sources remains a widespread and relevant problem. In particular, the Geographic Information System (GIS) domain presents unique forms of semantic heterogeneity that require a variety of matching approaches. Our approach considers content-based techniques for aligning GIS ontologies. We examine the associated instance data of the compared concepts and apply a content-matching strategy to measure similarity based on value types based on N-grams present in the data. We focus special attention on a method applying the concepts of mutual information and N-grams by developing 2 separate variations and testing them over GIS dataset including multi-jurisdictions. In order to align concepts, first we find the appropriate columns. For this, we will exploit mutual information between two columns based on the type distribution of their content. Intuitively, if two columns are semantically same, type distribution should be very similar. We justify the conceptual validity of our ontology alignment technique with a series of experimental results that demonstrate the efficacy and utility of our algorithms on a wide-variety of authentic GIS data.
MatchBench: Benchmarking Schema Matching Algorithms for Schematic Correspondences
"... Schemamatchingalgorithmsaimtoidentifyrelationshipsbetweendatabaseschemas, which underpin a wide range of model management operations, such as merge, compose and difference, as required for tasks such as data integration, exchange or schema evolution. However, the results of matching algorithms are t ..."
Abstract
- Add to MetaCart
(Show Context)
Schemamatchingalgorithmsaimtoidentifyrelationshipsbetweendatabaseschemas, which underpin a wide range of model management operations, such as merge, compose and difference, as required for tasks such as data integration, exchange or schema evolution. However, the results of matching algorithms are typically expressed as lowlevel, 1-to-1 associations between pairs of attributes or entity types, rather than as the higher-level characterisations of relationships required by many implementations of model management operations. This paper presents a benchmark for evaluating schema matching algorithmsthat is based onthe well establishedclassificationofschematic heterogeneities of Kim et al., which explores the extent to which the matching algorithms are effective at diagnosing schematic heterogeneities. The paper contributes: (i) a wide rangeofscenariosthat manifestdifferent typesofschematicheterogeneities; (ii) acollection of experiments over the scenarios that can be used to investigate the performance of different matching algorithms; and (iii) an application of the experiments for the evaluation of matchers from three well-known and publicly available schema matching platforms, namely COMA++, Rondo and OpenII. 1
An Automatic Domain Independent Schema Matching in Integrating Schemas of Heterogeneous Relational Databases
"... Schema matching is one of the key challenges in the process of integrating heterogeneous databases which identifies the correspondences among different elements of databases ’ schemas. There are several semi-automatic schema matching algorithms, however these algorithms do not exploit most of the av ..."
Abstract
- Add to MetaCart
(Show Context)
Schema matching is one of the key challenges in the process of integrating heterogeneous databases which identifies the correspondences among different elements of databases ’ schemas. There are several semi-automatic schema matching algorithms, however these algorithms do not exploit most of the available information related to schemas during the process of schema matching which affects on the accuracy of schema matching result. In this paper, we proposed a domain independent schema matching approach which utilized both the structural and semantic information during the process of schema matching and offered database integration without user intervention. In order to ensure the correctness of our proposed approach, we provide a proof on the validation of our proposed approach in terms of maintaining the properties of the initial input schemas as well as the characteristics of the relational model. In comparison with previous approaches, our proposed approach produced better global schemas during integration process.
2. REPORT TYPE
, 2009
"... This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or sel ..."
Abstract
- Add to MetaCart
(Show Context)
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: