Results 1 - 10
of
124
Detecting Changes in XML Documents
- In ICDE
, 2001
"... We present a diff algorithm for XML data. This work is motivated by the support for change control in the context of the Xyleme project that is investigating dynamic warehouses capable of storing massive volume of XML data. Because of the context, our algorithm has to be very efficient in terms of s ..."
Abstract
-
Cited by 159 (3 self)
- Add to MetaCart
(Show Context)
We present a diff algorithm for XML data. This work is motivated by the support for change control in the context of the Xyleme project that is investigating dynamic warehouses capable of storing massive volume of XML data. Because of the context, our algorithm has to be very efficient in terms of speed and memory space even at the cost of some loss of "quality". Also, it considers, besides insertions, deletions and updates (standard in diffs), a move operation on subtrees that is essential in the context of XML. Intuitively, our diff algorithm uses signatures to match (large) subtrees that were left unchanged between the old and new versions. Such exact matchings are then possibly propagated to ancestors and descendants to obtain more matchings. It also uses XML specific information such as ID attributes. We provide a performance analysis of the algorithm. We show that it runs in average in linear time vs. quadratic time for previous algorithms. We present experiments on synthetic data that confirm the analysis. Since this problem is NPhard, the linear time is obtained by trading some quality. We present experiments (again on synthetic data) that show that the output of our algorithm is reasonably close to the "optimal" in terms of quality. Finally we present experiments on a small sample of XML pages found on the Web. 1
X-Diff: An Effective Change Detection Algorithm for XML Documents
"... XML has become the de facto standard format for web publishing and data transportation. Since online information changes frequently, being able to quickly detect changes in XML documents is important to Internet query systems, search engines, and continuous query systems. Previous work in change det ..."
Abstract
-
Cited by 132 (0 self)
- Add to MetaCart
XML has become the de facto standard format for web publishing and data transportation. Since online information changes frequently, being able to quickly detect changes in XML documents is important to Internet query systems, search engines, and continuous query systems. Previous work in change detection on XML, or other hierarchically structured documents, used an ordered tree model, in which left-to-right order among siblings is important and it can affect the change result. This paper argues that an unordered model (only ancestor relationships are significant) is more suitable for most database applications. Using an unordered model, change detection is substantially harder than using the ordered model, but the change result that it generates is more accurate. This paper proposes X-Diff, an effective algorithm that integrates key XML structure characteristics with standard tree-to-tree correction techniques. The algorithm is analyzed and compared with XyDiff [CAM02], a published XML diff algorithm. An experimental evaluation on both algorithms is provided.
Identifying Syntactic Differences Between Two Programs
- Software - Practice and Experience
, 1991
"... this paper is organized into five sections, as follows. The internal form of a program, which is a variant of a parse tree, is discussed in the next section. Then the tree-matching algorithm and the synchronous pretty-printing technique are described. Experience with the comparator for the C languag ..."
Abstract
-
Cited by 119 (0 self)
- Add to MetaCart
this paper is organized into five sections, as follows. The internal form of a program, which is a variant of a parse tree, is discussed in the next section. Then the tree-matching algorithm and the synchronous pretty-printing technique are described. Experience with the comparator for the C language and some performance measurements are also presented. The last section discusses related work and concludes this paper
Identifying the Semantic and Textual Differences Between Two Versions of a Program
- Proceedings of the ACM SIGPLAN 90 Conference on Programming Language Design and Implementation
, 1990
"... Text-based file comparators (e.g., the Unix utility diff), are very general tools that can be applied to arbitrary files. However, using such tools to compare programs can be unsatisfactory because their only notion of change is based on program text rather than program behavior. This paper describe ..."
Abstract
-
Cited by 109 (5 self)
- Add to MetaCart
Text-based file comparators (e.g., the Unix utility diff), are very general tools that can be applied to arbitrary files. However, using such tools to compare programs can be unsatisfactory because their only notion of change is based on program text rather than program behavior. This paper describes a technique for comparing two versions of a program, determining which program components represent changes, and classifying each changed component as representing either a semantic or a textual change. ######################## This work was supported in part by the Defense Advanced Research Projects Agency, monitored by the Office of Naval Research under contract N00014-88-K, by the National Science Foundation under grant CCR8958530, and by grants from Xerox, Kodak, and Cray. Author's address: Computer Sciences Department, Univ. of Wisconsin, 1210 W. Dayton St., Madison, WI 53706. Permission to copy without fee all or part of this material is granted provided that the copies are not made...
Evaluating Structural Similarity in XML Documents
, 2002
"... XML documents on the web are often found without DTDs, particularly when these documents have been created from legacy HTML. Yet having knowledge of the DTD can be valuable in querying and manipulating such documents. Recent work (cf. [10]) has given us a means to (re-)construct a DTD to describe th ..."
Abstract
-
Cited by 104 (0 self)
- Add to MetaCart
(Show Context)
XML documents on the web are often found without DTDs, particularly when these documents have been created from legacy HTML. Yet having knowledge of the DTD can be valuable in querying and manipulating such documents. Recent work (cf. [10]) has given us a means to (re-)construct a DTD to describe the structure common to a given set of document instances. However, given a collection of documents with unknown DTDs, it may not be appropriate to construct a single DTD to describe every document in the collection. Instead, we would wish to partition the collection into smaller sets of "similar" documents, and then induce a separate DTD for each such set. It is this partitioning problem that we address in this paper. Given two
On the Editing Distance between Undirected Acyclic Graphs
, 1995
"... We consider the problem of comparing CUAL graphs (Connected, Undirected, Acyclic graphs with nodes being Labeled). This problem is motivated by the study of information retrieval for bio-chemical and molecular databases. Suppose we define the distance between two CUAL graphs G1 and G2 to be the weig ..."
Abstract
-
Cited by 88 (7 self)
- Add to MetaCart
(Show Context)
We consider the problem of comparing CUAL graphs (Connected, Undirected, Acyclic graphs with nodes being Labeled). This problem is motivated by the study of information retrieval for bio-chemical and molecular databases. Suppose we define the distance between two CUAL graphs G1 and G2 to be the weighted number of edit operations (insert node, delete node and relabel node) to transform G1 to G2. By reduction from exact cover by 3-sets, one can show that finding the distance between two CUAL graphs is NP-complete. In view of the hardness of the problem, we propose a constrained distance metric, called the degree-2 distance, by requiring that any node to be inserted (deleted) have no more than 2 neighbors. With this metric, we present an efficient algorithm to solve the problem. The algorithm runs in time O(N_1 N_2 D²) for general weighting edit operations and in time O(N_1 N_2 D √D log D) for integral weighting edit operations, where N_i, i = 1, 2, is the number of nodes in G_i, D = min{d_1, d_2} and d_i is the maximum degree of G_i.
K2/Kleisli and GUS: experiments in integrated access to genomic data sources
- IBM SYSTEMS JOURNAL
, 2001
"... The integration of heterogeneous data sources and software systems is a major issue in the biomedical community and several approaches have been explored: linking databases, "on-the- fly " integration through views, and integration through warehousing. In this paper we report on our experi ..."
Abstract
-
Cited by 74 (7 self)
- Add to MetaCart
(Show Context)
The integration of heterogeneous data sources and software systems is a major issue in the biomedical community and several approaches have been explored: linking databases, "on-the- fly " integration through views, and integration through warehousing. In this paper we report on our experiences with two systems that were developed at the University of Pennsylvania: an integration system called K2, which has primarily been used to provide views over multiple external data sources and software systems; and a data warehouse called GUS which downloads, cleans, integrates and annotates data from multiple external data sources. Although the view and warehouse approaches each have their advantages, there is no clear "winner". Therefore, users must consider how the data is to be used, what the performance guarantees must be, and how much programmer time and expertise is available to choose the best strategy for a particular application.
Approximate Tree Pattern Matching.
- In Apostolico and Galil (eds.), Pattern Matching Algorithms. Pp.
, 1997
"... ..."
Comparing hierarchical data in external memory
- In 25th Very Large Data Base Conference (VLDB
, 1999
"... We present an external-memory algorithm for computing a minimum-cost edit script between two rooted, ordered, labeled trees. The I/O, RAM, and CPU costs of our algorithm are, respectively, 4mn+7m+5n, 6S, andO(MN+(M+N)S1:5), whereMandNare the input tree sizes,Sis the block size,m=M=S, andn=N=S. This ..."
Abstract
-
Cited by 65 (2 self)
- Add to MetaCart
We present an external-memory algorithm for computing a minimum-cost edit script between two rooted, ordered, labeled trees. The I/O, RAM, and CPU costs of our algorithm are, respectively, 4mn+7m+5n, 6S, andO(MN+(M+N)S1:5), whereMandNare the input tree sizes,Sis the block size,m=M=S, andn=N=S. This algorithm can make effective use of surplus RAM capacity to quadratically reduce I/O cost. We extend to trees the commonly used mapping from sequence comparison problems to shortest-path problems in edit graphs. 1
A Survey of Paraphrasing and Textual Entailment Methods
, 2010
"... Paraphrasing methods recognize, generate, or extract phrases, sentences, or longer natural language expressions that convey almost the same information. Textual entailment methods, on the other hand, recognize, generate, or extract pairs of natural language expressions, such that a human who reads ( ..."
Abstract
-
Cited by 57 (3 self)
- Add to MetaCart
(Show Context)
Paraphrasing methods recognize, generate, or extract phrases, sentences, or longer natural language expressions that convey almost the same information. Textual entailment methods, on the other hand, recognize, generate, or extract pairs of natural language expressions, such that a human who reads (and trusts) the first element of a pair would most likely infer that the other element is also true. Paraphrasing can be seen as bidirectional textual entailment and methods from the two areas are often similar. Both kinds of methods are useful, at least in principle, in a wide range of natural language processing applications, including question answering, summarization, text generation, and machine translation. We summarize key ideas from the two areas by considering in turn recognition, generation, and extraction methods, also pointing to prominent articles and resources.