Results 1 -
4 of
4
A methodology for clustering XML documents by structure
- Information Systems
, 2006
"... The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. Such grouping results from the application of ..."
Abstract
-
Cited by 50 (0 self)
- Add to MetaCart
The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. Such grouping results from the application of clustering methods with distances that estimate the similarity between tree structures. This paper presents a framework for clustering XML documents by structure. Modeling the XML documents as rooted ordered labeled trees, we study the usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents. We suggest the usage of structural summaries for trees to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Our approach is tested using a prototype testbed.
A bit-vector algorithm for computing Levenshtein and Damerau edit distances
- Nordic Journal of Computing
, 2003
"... Abstract. The edit distance between strings A and B is defined as the minimum number of edit operations needed in converting A into B or vice versa. The Levenshtein edit distance allows three types of operations: an insertion, a deletion or a substitution of a character. The Damerau edit distance al ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
Abstract. The edit distance between strings A and B is defined as the minimum number of edit operations needed in converting A into B or vice versa. The Levenshtein edit distance allows three types of operations: an insertion, a deletion or a substitution of a character. The Damerau edit distance allows the previous three plus in addition a transposition between two adjacent characters. To our best knowledge the best current practical algorithms for computing these edit distances run in time O(dm) and O(σ + ⌈m/w⌉n), where d is the edit distance between the two strings, m and n are their lengths (m ≤ n), w is the computer word size and σ is the size of the alphabet. In this paper we present an algorithm that runs in time O(σ + ⌈d/w⌉m). The structure of the algorithm is such, that in practice it is mostly suitable for testing whether the edit distance between two strings is within some pre-determined error threshold. We also present some initial test results with thresholded edit distance computation. In them our algorithm works faster than the original algorithm of Myers.
Approximate Multiple String Searching by Clustering
"... We are given a nite set S of text strings and a pattern P over some xed alphabet 6. The topic of this paper is the design of a data structure D(S) which supports approximate multiple string searching queries e ciently. Thereby, for a given upper bound k 2 Z + on the allowable distance, P = p 1 111pm ..."
Abstract
- Add to MetaCart
(Show Context)
We are given a nite set S of text strings and a pattern P over some xed alphabet 6. The topic of this paper is the design of a data structure D(S) which supports approximate multiple string searching queries e ciently. Thereby, for a given upper bound k 2 Z + on the allowable distance, P = p 1 111pm is said to appear approximately in a text T = t 1 111tn, m; n 2 Z +, if there exist positions u; v in T such that the edit distance between P and tu 111tv is at most k. Let N denote the sum of the lengths of all strings in S. Wepresent an algorithm that constructs the data structure D(S) in O(N) time and space. Afterwards, an approximate multiple string search query can be answered in O(N) expected-time if the allowable distance k is bounded above by O( m). The method can be used tosearch large log m nucleotide and amino acid sequence databases for similar sequences. 1