DMCA
A methodology for clustering XML documents by structure (2006)
Cached
Download Links
Venue: | Information Systems |
Citations: | 50 - 0 self |
Citations
10602 | Introduction to Algorithms
- Cormen, Leiverson, et al.
- 2009
(Show Context)
Citation Context ... since it has been shown to be theoretically sound, under a certain number of reasonable conditions [30]. 23s5.1.1 Single Link We implemented a single link clustering algorithm using Prim’s algorithm =-=[31]-=- for computing the minimum spanning tree (MST) of a graph. Given a graph G with a set of weighted edges E and a set of vertices V , a MST is an acyclic subset T ⊆ E that links all the vertices and who... |
852 | A re-examination of text categorization methods
- Yang, Liu
- 1999
(Show Context)
Citation Context ...sts time since all pairwise distances should be calculated again. Classification algorithms can assign new data to clusters already present. k-NN classification is a simple yet quite effective method =-=[38]-=-. A set of M training XML documents is 40sWithout structural summaries With structural summaries Cluster No a b c Cluster No a b c 1 (DTD 1) 70 0 0 1 (DTD 1) 70 0 0 2 (DTD 2) 70 0 0 2 (DTD 2) 70 0 0 3... |
832 |
The string-to-string correction problem
- Wagner, Fischer
- 1974
(Show Context)
Citation Context .... 2.3.5 Discussion All of the algorithms for calculating the edit distance for two ordered labeled trees are based on dynamic programming techniques related to the string-to-string correction problem =-=[19]-=-. The key issue of these techniques is the detection of the set of tree edit operations which tranforms a tree to another one with the minimum cost (assuming a cost model to assign costs for every tre... |
663 | An evaluation of statistical approaches to text categorization.
- Yang
- 1999
(Show Context)
Citation Context ...uch case, clustering quality metrics will be affected (see next paragraphs). To evaluate the clustering results, we used two metrics quite popular in information retrieval: precision P R and recall R =-=[30, 35, 36]-=-. For an extracted cluster Ci that corresponds to a DTD Di let: 1. ai be the number of the XML documents in Ci that were indeed members of that cluster (correctly clustered), 2. bi be the number of th... |
573 | DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases
- Goldman, Widom
- 2002
(Show Context)
Citation Context .... Structural summaries have minimal processing requirements to extract and use instead of the original XML documents in the clustering procedure. Structural summaries resemble the dataguide summaries =-=[22]-=-. However, a dataguide is a summary of the structure of semistructured data described by the OEM model, while structural summaries are based on the XML data model (see Section 2.1 for the differences ... |
549 |
An examination of procedures of determining the number of clusters in a data set
- Milligan
- 1985
(Show Context)
Citation Context ...tection and single link clustering at level 0.6. A stopping rule is necessary to determine the most appropriate clustering level for the single link hierarchies. Milligan et al. present 30 such rules =-=[33]-=-. Among these rules, C−index [34] exhibits excellent performance (found in the top 3 stopping rules). We next present the way we adopt the C−index in a hierarchical clustering procedure. 5.1.2 C−index... |
530 | Querying semi-structured data
- Abiteboul
- 1997
(Show Context)
Citation Context ...es t1, t2 Root of t1 = C, Root of t2 = K Root of t1 = B, Root of t2 = K Root of t1 = R, Root of t2 = R D[i,j] D[0][0]=1, D[0][1]=3 D[0][0]=1, D[0][1]=3, D[1][0]=2, D[1][1]=3 D[0][0]=0, D[0][1]=1, D[0]=-=[2]-=-=2, D[0][3]=5, D[0][4]=6, D[1][0]=1, D[1][1]=0, D[1][2]=1, D[1][3]=4, D[1][4]=5, D[2][0]=3, D[2][1]=2, D[2][2]=2, D[2][3]=4, D[2][4]=5, D[3][0]=4, D[3][1]=3, D[3][2]=3, D[3][3]=5, D[3][4]=5 (Distance ... |
509 | Object Exchange Across Heterogeneous Information Sources
- Papakonstantinou, García-Molina, et al.
- 1995
(Show Context)
Citation Context ...le models which capture schemaless, self-describing and irregular data. The object exchange model (OEM) is a graph representation of a collection of objects. OEM was introduced in the TSIMMIS project =-=[12, 13]-=-. Every OEM object has an identifier and a value, atomic or complex. An atomic value is an integer, real, string or any other data, while a complex value is a set of oids, each linked to the parent no... |
480 | Reexamining the cluster hypothesis: Scatter/gather on retrieval results.
- Hearst, Pedersen
- 1996
(Show Context)
Citation Context ...with time requirements of O(n 2 ), if n documents need to be clustered. However, hierarchical methods have been used extensively as a means of increasing the effectiveness and efficiency of retrieval =-=[25, 26, 27]-=-. For a wide ranging overview of clustering methods one can refer to [28, 29]. Single link, complete link and group average link are known as hierarchical clustering methods. All these methods are bas... |
478 | Relational databases for querying xml documents: Limitations and opportunities
- Shanmugasundaram, Tufte, et al.
- 1999
(Show Context)
Citation Context ... it. A DTD, besides enabling exchange of documents through common vocabulary and standards, can generate relational schemas to efficiently store and query XML documents in relational database systems =-=[5]-=-. However, many XML documents are constructed massively from data sources like RDBMSs, flat files, etc, without DTDs. XTRACT [6, 7] and IBM AlphaWorks DDbE 2 are DTD discovery tools that automatically... |
472 |
Time warps, string edits, and macromolecules: the theory and practice of sequence comparison
- Sankoff, Kruskal, et al.
- 1983
(Show Context)
Citation Context ...documents, is a useful task in bioinformatics. The detection of homologous protein structures encoded as XML documents (i.e. sets of protein structures sharing a similar structure) is such an example =-=[9]-=-. Other XML encodings for life sciences are presented in [10]. 3s1.2 Contribution The contribution of this paper is a methodology for clustering XML documents by structure, exploiting algorithms to ca... |
417 | The TSIMMIS approach to mediation: Data models and languages.
- Garcia-Molina, Papakonstantinou, et al.
- 1997
(Show Context)
Citation Context ...le models which capture schemaless, self-describing and irregular data. The object exchange model (OEM) is a graph representation of a collection of objects. OEM was introduced in the TSIMMIS project =-=[12, 13]-=-. Every OEM object has an identifier and a value, atomic or complex. An atomic value is an integer, real, string or any other data, while a complex value is a set of oids, each linked to the parent no... |
405 | Simple fast algorithms for the editing distance between trees and related problems,”
- Zhang, Shasha
- 1989
(Show Context)
Citation Context ...C R A C R A D R A D C K insert D insert C replace C,K R A D insert P R A D K C K C P T2 R A D K insert O Figure 5: An example of an edit sequence to transform T1 to T2. There are different approaches =-=[14, 15, 16, 17]-=- to determine tree edit sequences and tree edit distances. All utilize similar tree edit operations with minor variations. Before we discuss each 6 C P Osalgorithm in detail, we present a general form... |
305 |
Rijsbergen, Information Retrieval
- Van
- 1979
(Show Context)
Citation Context ...e single link to be the basic clustering algorithm for the core part of the experiments for our work since it has been shown to be theoretically sound, under a certain number of reasonable conditions =-=[30]-=-. 23s5.1.1 Single Link We implemented a single link clustering algorithm using Prim’s algorithm [31] for computing the minimum spanning tree (MST) of a graph. Given a graph G with a set of weighted ed... |
280 |
The Tree-to-Tree Correction Problem.
- Tai
- 1979
(Show Context)
Citation Context ...istance metric. As a result, we do not consider such methods. The first work that defined the tree edit distance and provided algorithms to compute it, permitting operations anywhere in the tree, was =-=[21]-=-. Selkow’s algorithm [14] allows insertion and deletion only at leaf nodes, and relabel at every node. Its main recursion leads to increased complexity. Chawathe’s (II) algorithm [17] allows insertion... |
261 | Hierarchical clustering algorithms for document datasets
- Zhao, Karypis
(Show Context)
Citation Context ...n input, both single link and complete link performed 100% correctly. Having NumOfClusters as an input, the results were similar to ours. Non hierarchical methods, like repeated bisections algorithms =-=[37]-=-, showed similar results. We also performed the single link clustering task using IBM’s TreeDiff 9 , a set of Java beans that enable efficient differentiation and updating of DOM trees, providing its ... |
190 | Xirql: A query language for information retrieval in xml documents.
- Fuhr, Groβjohann
- 2001
(Show Context)
Citation Context ...mal tree edit sequence (see also the discussion in Section 2.3.5). Other methods, like in [40], concentrate on unordered trees. Research has also been conducted in the Information Retrieval Community =-=[41, 42, 43]-=- to evaluate similarity by content in a document-centric approach of XML data. Other works that exploit structural distances are [24, 44]. In [24], the set of tree edit operations include two new ones... |
189 |
Expert network: Effective and efficient learning from human decisions in text categorization and retrieval.
- Yang
- 1994
(Show Context)
Citation Context ... the structural summaries of these trees. Then the k top-ranked documents are used to decide the winning cluster(s) by adding the distances for the training documents which represent the same cluster =-=[39, 38]-=-: y(x, cj) = � S(x, di) × y(di, cj) (5) where: diɛkNN 1. x is an incoming document, di is a training document, cj is a category, 2. y(di, cj) = 1 if di belongs to cj or 0 otherwise, 3. S(x, di) is the... |
183 | Change detection in hierarchically structured information
- Chawathe, Rajaraman, et al.
- 1996
(Show Context)
Citation Context ...C R A C R A D R A D C K insert D insert C replace C,K R A D insert P R A D K C K C P T2 R A D K insert O Figure 5: An example of an edit sequence to transform T1 to T2. There are different approaches =-=[14, 15, 16, 17]-=- to determine tree edit sequences and tree edit distances. All utilize similar tree edit operations with minor variations. Before we discuss each 6 C P Osalgorithm in detail, we present a general form... |
174 | A graph distance metric based on the maximal common subgraph,
- Bunke, Shearer
- 1998
(Show Context)
Citation Context ... are stored in tables of relational database systems. Such a grouping decreases the 42snumber of join operations needed between tables during the query evaluation. The metric (originally suggested in =-=[45]-=-) is applied on graphs representing XML data, and it is based on the number of the common edges between graphs. The approach does not take into account the position of the edges in the graphs. In our ... |
159 | Detecting changes in XML documents
- Cobena, Abiteboul, et al.
- 2002
(Show Context)
Citation Context ...of the set of tree edit operations which tranforms a tree to another one with the minimum cost (assuming a cost model to assign costs for every tree edit operation). Methods for change detection (see =-=[20]-=- for a comparative study) can detect sets of edit operations with cost close to the minimal with significantly reduced computation time. However, minimality is important for the quality of any measure... |
158 |
The use of hierarchical clustering in information retrieval,
- Jardine, Rijsbergen
- 1971
(Show Context)
Citation Context ...with time requirements of O(n 2 ), if n documents need to be clustered. However, hierarchical methods have been used extensively as a means of increasing the effectiveness and efficiency of retrieval =-=[25, 26, 27]-=-. For a wide ranging overview of clustering methods one can refer to [28, 29]. Single link, complete link and group average link are known as hierarchical clustering methods. All these methods are bas... |
154 |
Clustering Algorithms, In
- Rasmussen
- 1992
(Show Context)
Citation Context ...hierarchical methods have been used extensively as a means of increasing the effectiveness and efficiency of retrieval [25, 26, 27]. For a wide ranging overview of clustering methods one can refer to =-=[28, 29]-=-. Single link, complete link and group average link are known as hierarchical clustering methods. All these methods are based on a similar idea: 1. Each element of the data set to be clustered is cons... |
132 |
Minimum spanning trees and single linkage cluster analysis,
- Gower, Ross
- 1969
(Show Context)
Citation Context ...ges E and a set of vertices V , a MST is an acyclic subset T ⊆ E that links all the vertices and whose total weight W (T ) (the sum of the weights for the edges in T ) is minimized. It has been shown =-=[32]-=- that a MST contains all the information needed in order to perform single link clustering. Given n structural summaries of rooted labeled trees that represent XML documents, we form a fully connected... |
132 | X-Diff: An effective change detection algorithm for XML documents
- Wang, DeWitt, et al.
- 2003
(Show Context)
Citation Context ...ng such data. Methods for file change detection [20] are related to our work, but they do not compute the minimal tree edit sequence (see also the discussion in Section 2.3.5). Other methods, like in =-=[40]-=-, concentrate on unordered trees. Research has also been conducted in the Information Retrieval Community [41, 42, 43] to evaluate similarity by content in a document-centric approach of XML data. Oth... |
125 | XTRACT: A System for Extracting Document Type Descriptors from XML Documents. Bell Labs Tech. Memorandum
- Garofalakis, Gionis, et al.
- 1999
(Show Context)
Citation Context ...fficiently store and query XML documents in relational database systems [5]. However, many XML documents are constructed massively from data sources like RDBMSs, flat files, etc, without DTDs. XTRACT =-=[6, 7]-=- and IBM AlphaWorks DDbE 2 are DTD discovery tools that automatically extract DTDs from XML documents. Such tools fail to discover meaningful DTDs in case of diverse XML document collections [7]. Cons... |
124 |
The Tree-to-Tree Editing Problem.
- Selkow
- 1977
(Show Context)
Citation Context ...C R A C R A D R A D C K insert D insert C replace C,K R A D insert P R A D K C K C P T2 R A D K insert O Figure 5: An example of an edit sequence to transform T1 to T2. There are different approaches =-=[14, 15, 16, 17]-=- to determine tree edit sequences and tree edit distances. All utilize similar tree edit operations with minor variations. Before we discuss each 6 C P Osalgorithm in detail, we present a general form... |
104 | Evaluating structural similarity in XML documents.
- Nierman, Jagadish
- 2002
(Show Context)
Citation Context ...at we use does not need the costly edit graph calculation of the latter (see the timing analysis in Section 6.4). A similar recurrence but for a different set of tree edit operations has been used in =-=[24]-=- (see Section 7). An insert node operation is permitted only if the new node becomes a leaf. A delete node operation is permitted only at leaf nodes. Any node can be updated using the replace node ope... |
82 | An o(nd) difference algorithm and its variations. Algorithmica 1(2 - Myers - 1986 |
81 |
Searching XML documents via XML fragments
- CARMEL, MAAREK, et al.
- 2003
(Show Context)
Citation Context ...mal tree edit sequence (see also the discussion in Section 2.3.5). Other methods, like in [40], concentrate on unordered trees. Research has also been conducted in the Information Retrieval Community =-=[41, 42, 43]-=- to evaluate similarity by content in a document-centric approach of XML data. Other works that exploit structural distances are [24, 44]. In [24], the set of tree edit operations include two new ones... |
65 | Comparing hierarchical data in external memory
- Chawathe
- 1999
(Show Context)
Citation Context ...C R A C R A D R A D C K insert D insert C replace C,K R A D insert P R A D K C K C P T2 R A D K insert O Figure 5: An example of an edit sequence to transform T1 to T2. There are different approaches =-=[14, 15, 16, 17]-=- to determine tree edit sequences and tree edit distances. All utilize similar tree edit operations with minor variations. Before we discuss each 6 C P Osalgorithm in detail, we present a general form... |
59 | Statistical synopses for graph-structured XML databases
- Polyzotis, Garofalakis
- 2002
(Show Context)
Citation Context ...le structural summaries are based on the XML data model (see Section 2.1 for the differences between these two models). Summaries in the form of synopses for XML databases have also been exploited in =-=[23]-=-. Such synopses approximate the path and branching distribution of the structure of XML data. They are used to support optimization for queries posed on XML data, and especially to enable accurate sel... |
49 | An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. In
- Lian, Cheung, et al.
- 2004
(Show Context)
Citation Context ...also been conducted in the Information Retrieval Community [41, 42, 43] to evaluate similarity by content in a document-centric approach of XML data. Other works that exploit structural distances are =-=[24, 44]-=-. In [24], the set of tree edit operations include two new ones which refer to whole trees (insert tree and delete tree operations) rather than nodes. Trees are pre-processed for checking whether a su... |
47 |
The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval
- Voorhees
- 1986
(Show Context)
Citation Context ...with time requirements of O(n 2 ), if n documents need to be clustered. However, hierarchical methods have been used extensively as a means of increasing the effectiveness and efficiency of retrieval =-=[25, 26, 27]-=-. For a wide ranging overview of clustering methods one can refer to [28, 29]. Single link, complete link and group average link are known as hierarchical clustering methods. All these methods are bas... |
28 | XTRACT: Learning Document Type Descriptors from XML Document Collections
- Garofalakis, Gionis, et al.
- 2003
(Show Context)
Citation Context ...fficiently store and query XML documents in relational database systems [5]. However, many XML documents are constructed massively from data sources like RDBMSs, flat files, etc, without DTDs. XTRACT =-=[6, 7]-=- and IBM AlphaWorks DDbE 2 are DTD discovery tools that automatically extract DTDs from XML documents. Such tools fail to discover meaningful DTDs in case of diverse XML document collections [7]. Cons... |
28 |
Clustering algorithms and validity measures
- Haldiki, Batistakis, et al.
(Show Context)
Citation Context ...hierarchical methods have been used extensively as a means of increasing the effectiveness and efficiency of retrieval [25, 26, 27]. For a wide ranging overview of clustering methods one can refer to =-=[28, 29]-=-. Single link, complete link and group average link are known as hierarchical clustering methods. All these methods are based on a similar idea: 1. Each element of the data set to be clustered is cons... |
16 | Configurable Indexing and Ranking for XML Information Retrieval
- Liu, Zou, et al.
- 2004
(Show Context)
Citation Context ...mal tree edit sequence (see also the discussion in Section 2.3.5). Other methods, like in [40], concentrate on unordered trees. Research has also been conducted in the Information Retrieval Community =-=[41, 42, 43]-=- to evaluate similarity by content in a document-centric approach of XML data. Other works that exploit structural distances are [24, 44]. In [24], the set of tree edit operations include two new ones... |
13 |
Expert network: Eective and ecient learning from human decisions in text categorization and retrieval
- Yang
- 1994
(Show Context)
Citation Context ... the structural summaries of these trees. Then the k top-ranked documents are used to decide the winning cluster(s) by adding the distances for the training documents which represent the same cluster =-=[33, 32]-=-: y(x; cj) = X dikNN S(x;di) y(di; cj) (6) where: 1. x is an incoming document, di is a training document, cj is a category, 2. y(di; cj) = 1 if di belongs to cj or 0 otherwise, 3. S(x;di) is the st... |
6 | Specifying transformations for structured documents
- Tang, Tompa
(Show Context)
Citation Context ...nd t2 with only its first 3 subtrees. The algorithm spends cr = 1 to replace B with K, cr = 1 to replace D with C, ci = 1 to insert P under C and ci = 1 to insert D under R: a cost of 4 units. 2. D[3]=-=[4]-=- = 5: D[3][4] keeps the distance between t1 with its first 3 subtrees and t2 with its first 4 subtrees. Actually, this is the distance between T1 and T2. The algorithm spends 21 C P Oscr = 1 to replac... |
5 |
A general statistical framework for accessing categorical clustering in free recall
- Hubert, Levin
- 1976
(Show Context)
Citation Context ...g at level 0.6. A stopping rule is necessary to determine the most appropriate clustering level for the single link hierarchies. Milligan et al. present 30 such rules [33]. Among these rules, C−index =-=[34]-=- exhibits excellent performance (found in the top 3 stopping rules). We next present the way we adopt the C−index in a hierarchical clustering procedure. 5.1.2 C−index for Hierarchical Clustering C−in... |
4 | An O(ND) dierence algorithm and its variations, Algorithmica 1 - Myers - 1986 |
3 |
Geographical Data Interchange Using XML-Enabled Technology within the GIDB(TM) System”, invited in edited manuscript: A B. Chaudhri (ed), XML Data Management
- Wilson, Cobb, et al.
- 2003
(Show Context)
Citation Context ...nce D2 only misses the river element. On the other hand, area encoded by D3 is organized in a different way than D1 and D2. Examples on using XML representation for geographical data are presented in =-=[8]-=-. (D1) --- <?xml version="1.0"?> <area type="rectangle" x1="100" y1="200"..."> <forest type="rectangle" x1="20" x2="20"..."> <lake type="circle" x1="5" y1="10" r1="5">> The lake </lake> <farm type="re... |
3 |
Detecting similarities between XML documents
- Flesca, Manco, et al.
- 2002
(Show Context)
Citation Context ... these properties hold (see for example Figure 27 for the symmetry) but formal study is needed to confirm it. Also, we will study how to employ vector-based representation of tree structures (like in =-=[46, 47]-=-) to further explore the problem of clustering by structure. Other interesting issues involve (a) the application of the framework in collections where the repetition of nodes has a certain meaning, s... |
3 |
The eectiveness and eciency of agglomerative hierarchic clustering in document retrieval
- Voorhees
- 1986
(Show Context)
Citation Context ...ive, with time requirements of O(n2), if n documents need to be clustered. However, hierarchical methods have been used extensively as a means of increasing the eectiveness and eciency of retrieval =-=[19, 20, 21]-=-. For a wide ranging overview of clustering methods one can refer to [22, 23]. Single link, complete link and group average link are known as hierarchical clustering methods. All these methods are bas... |
2 |
Knowledge management in bioinformatics, in
- Direen, Jones
(Show Context)
Citation Context ... of homologous protein structures encoded as XML documents (i.e. sets of protein structures sharing a similar structure) is such an example [9]. Other XML encodings for life sciences are presented in =-=[10]-=-. 3s1.2 Contribution The contribution of this paper is a methodology for clustering XML documents by structure, exploiting algorithms to calculate the minimum cost (known as tree edit distance) to tra... |
1 |
Using a structural distance metric to cluster xml documents by structure
- Dalamagas, Cheng, et al.
- 2004
(Show Context)
Citation Context ...represent XML documents instead of the original trees improves further the performance of the structural distance calculation without affecting its quality. Preliminary work has been also appeared in =-=[11]-=-. 1.3 Outline The paper is organised as follows. Section 2 presents background information for the representation of XML data as rooted ordered labeled trees or graphs and analyzes various algorithms ... |
1 |
Evaluating text categorization, in
- Lewie
- 1991
(Show Context)
Citation Context ...uch case, clustering quality metrics will be affected (see next paragraphs). To evaluate the clustering results, we used two metrics quite popular in information retrieval: precision P R and recall R =-=[30, 35, 36]-=-. For an extracted cluster Ci that corresponds to a DTD Di let: 1. ai be the number of the XML documents in Ci that were indeed members of that cluster (correctly clustered), 2. bi be the number of th... |
1 | Bitmap indexing-based clustering and retrieval of xml documents
- Yoon, Raghavan
- 2001
(Show Context)
Citation Context ... these properties hold (see for example Figure 27 for the symmetry) but formal study is needed to confirm it. Also, we will study how to employ vector-based representation of tree structures (like in =-=[46, 47]-=-) to further explore the problem of clustering by structure. Other interesting issues involve (a) the application of the framework in collections where the repetition of nodes has a certain meaning, s... |