Results 1 - 10
of
50
PageTailor: Reusable End-User Customization for the Mobile Web
, 2007
"... Most pages on the Web are designed for the desktop environment and render poorly on the small screens available on handheld devices. Existing techniques to support small screens require that either providers handcraft device-specific versions or automatically adapt Web pages ignoring the interests o ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
(Show Context)
Most pages on the Web are designed for the desktop environment and render poorly on the small screens available on handheld devices. Existing techniques to support small screens require that either providers handcraft device-specific versions or automatically adapt Web pages ignoring the interests of the user. We introduce Reusable End-User Customization, a technique for adapting Web pages for the limited screens of hand-held devices. We present PageTailor, an adaptive Web browser that implements this technique, and lets users adapt the layout of Web pages. We show that users can uti-lize PageTailor to adapt sophisticated Web sites, such as Amazon, BBC and MSN, for browsing on a PDA. We also show that user customizations remain effective for up to a year, even as the content of pages is updated, and that customizations can be reused across similar pages, limiting the customization effort required to browse an entire site.
The pq-Gram Distance between Ordered Labeled Trees
- ACM TRANSACTIONS ON DATABASE SYSTEMS (TODS)
, 2010
"... When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
(Show Context)
When integrating data from autonomous sources, exact matches of data items that represent the same real world object often fail due to a lack of common keys. Yet in many cases structural information is available and can be used to match such data. Typically the matching must be approximate since the representations in the sources differ. We propose pq-grams to approximately match hierarchical data from autonomous sources and define the pq-gram distance between ordered labeled trees as an effective and efficient approximation of the fanout weighted tree edit distance. We prove that the pq-gram distance is a lower bound of the fanout weighted tree edit distance and give a normalization of the pq-gram distance for which the triangle inequality holds. Experiments on synthetic and real world data (residential addresses and XML) confirm the scalability of our approach and show the effectiveness of pq-grams.
Approximate joins for data-centric XML
- In Proceedings of the International Conference on Data Engineering (ICDE). IEEE Computer Society
, 2008
"... Abstract- In data integration applications, a join matches elements that are common to two data sources. Often, however, elements are represented slightly different in each source, so an approximate join must be used. For XML data, most approximate join strategies are based on some ordered tree matc ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
(Show Context)
Abstract- In data integration applications, a join matches elements that are common to two data sources. Often, however, elements are represented slightly different in each source, so an approximate join must be used. For XML data, most approximate join strategies are based on some ordered tree matching technique. But in data-centric XML the order is irrelevant: two elements should match even if their subelement order varies. In this paper we give a solution for the approximate join of unordered trees. Our solution is based on windowed pq-grams. We develop an efficient technique to systematically generate windowed pq-grams in a three-step process: sorting the unordered tree, extending the sorted tree with dummy nodes, and computing the windowed pq-grams on the extended tree. The windowed pq-gram distance between two sorted trees approximates the tree edit distance between the respective unordered trees. The approximate join algorithm based on windowed pq-grams is implemented as an equality join on strings which avoids the costly computation of the distance between every pair of input trees. Our experiments with synthetic and real world data confirm the analytic results and suggest that our technique is both useful and scalable. I.
An overview of similarity measures for clustering XML documents
- Chapter in Athena Vakali and George Pallis
, 2006
"... The large amount and heterogeneity of XML documents on the Web require the development of clustering techniques to group together similar documents. Documents can be grouped together according to their content, their structure, and links inside and among documents. For instance, grouping together do ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
The large amount and heterogeneity of XML documents on the Web require the development of clustering techniques to group together similar documents. Documents can be grouped together according to their content, their structure, and links inside and among documents. For instance, grouping together documents with similar structures has interesting applications in the context of information extraction, of heterogeneous data integration, of personalized content delivery, of access control definition, of web site structural analysis, of comparison of RNA secondary structures. Many approaches have been proposed for evaluating the structural and content similarity between tree-based and vector-based representations of XML documents. Link-based similarity approaches developed for Web data clustering have been adapted for XML documents. This chapter discusses and compares the most relevant similarity measures and their employment for XML document clustering.
A flexible structuredbased representation for XML document mining
- In Proceedings of the 4th International of the Initiative for the Evaluation of XML Retrieval, INEX'05, Schloss Dagstuhl
, 2006
"... Abstract. This paper reports on the INRIA group’s approach to XML mining while participating in the INEX XML Mining track 2005. We use a flexible representation of XML documents that allows taking into account the structure only or both the structure and content. Our approach consists of representin ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
Abstract. This paper reports on the INRIA group’s approach to XML mining while participating in the INEX XML Mining track 2005. We use a flexible representation of XML documents that allows taking into account the structure only or both the structure and content. Our approach consists of representing XML documents by a set of their subpaths, defined according to some criteria (length, root beginning, leaf ending). By considering those sub-paths as words, we can use standard methods for vocabulary reduction, and simple clustering methods such as k-means. We use an implementation of the clustering algorithm known as dynamic clouds that can work with distinct groups of independent modalities put in separate variables. This is useful in our model since embedded sub-paths are not independent: we split potentially dependant paths into separate variables, resulting in each of them containing independant paths. Experiments with the INEX collections show good results for the structure-only collections, but our approach could not scale well for large structure-and-content collections. 1
Highly Efficient Algorithms for Structural Clustering of Large Websites
"... In this paper, we present a highly scalable algorithm for structurally clustering webpages for extraction. We show that, using only the URLs of the webpages and simple content features, it is possible to cluster webpages effectively and efficiently. At the heart of our techniques is a principled fra ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
In this paper, we present a highly scalable algorithm for structurally clustering webpages for extraction. We show that, using only the URLs of the webpages and simple content features, it is possible to cluster webpages effectively and efficiently. At the heart of our techniques is a principled framework, based on the principles of information theory, that allows us to effectively leverage the URLs, and combine them with content and structural properties. Using an extensive evaluation over several large full websites, we demonstrate the effectiveness of our techniques, at a scale unattainable by previous techniques.
XEdge: Clustering homogeneous and heterogeneous XML documents using edge summaries
- In
, 2008
"... In this paper we propose a unified clustering algorithm for both homogeneous and heterogeneous XML documents. Depending on the type of the XML documents, the proposed algorithm modifies its distance metric in order to properly adapt to the special structural characteristics of homogeneous and hetero ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
In this paper we propose a unified clustering algorithm for both homogeneous and heterogeneous XML documents. Depending on the type of the XML documents, the proposed algorithm modifies its distance metric in order to properly adapt to the special structural characteristics of homogeneous and heterogeneous XML documents. We compare the quality of the formed clusters with those of one of the latest XML clustering algorithms and show that our algorithm outperforms it in the case of both homogeneous and heterogeneous XML documents.
Structural Similarity Evaluation between XML Documents and DTDs
"... Abstract. The automatic processing and management of XML-based data are ever more popular research issues due to the increasing abundant use of XML, especially on the Web. Nonetheless, several operations based on the structure of XML data have not yet received strong attention. Among these is the pr ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
(Show Context)
Abstract. The automatic processing and management of XML-based data are ever more popular research issues due to the increasing abundant use of XML, especially on the Web. Nonetheless, several operations based on the structure of XML data have not yet received strong attention. Among these is the process of matching XML documents and XML grammars, useful in various applications such as documents classification, retrieval and selective dissemination of information. In this paper, we propose an algorithm for measuring the structural similarity between an XML document and a Document Type Definition or DTD considered as the simplest way for specifying structural constraints on XML documents. We consider the various DTD operators that designate constraints on the existence, repeatability and alternativeness of XML elements/attributes. Our approach is based on the concept of tree edit distance, as an effective and efficient means for comparing tree structures, XML documents and DTDs being modeled as ordered labeled trees. It is of polynomial complexity, in comparison with existing exponential algorithms. Classification experiments, conducted on large sets of real and synthetic XML documents, underline our approach’s effectiveness, as well as its applicability to large XML repositories and databases. Keywords: Semi-structured XML-based data, XML grammar, DTD, structural similarity, tree edit distance. 1
Efficient XML Structural Similarity Detection using Sub-tree Commonalities
- In SBBD and SIGMOD DiSC
, 2007
"... Abstract. Developing efficient techniques for comparing XML-based documents becomes essential in the database and information retrieval communities. Various algorithms for comparing hierarchically structured data, e.g. XML documents, have been proposed in the literature. Most of them make use of tec ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
(Show Context)
Abstract. Developing efficient techniques for comparing XML-based documents becomes essential in the database and information retrieval communities. Various algorithms for comparing hierarchically structured data, e.g. XML documents, have been proposed in the literature. Most of them make use of techniques for finding the edit distance between tree structures, XML documents being modeled as ordered labeled trees. Nevertheless, a thorough investigation of current approaches led us to identify several unaddressed structural similarities, i.e. sub-tree related similarities, while comparing XML documents. In this paper, we provide an improved comparison method to deal with such resemblances. Our approach is based on the concept of tree edit distance, introducing the notion of commonality between sub-trees. Experiments demonstrate that our approach yields better similarity results with respect to alternative methods, while maintaining quatratic time complexity. 1.
A Fine-Grained XML Structural Comparison Approach
"... Abstract. As the Web continues to grow and evolve, more and more information is being placed in structurally rich documents, XML documents in particular, so as to improve the efficiency of similarity clustering, information retrieval and data management applications. Various algorithms for comparing ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
(Show Context)
Abstract. As the Web continues to grow and evolve, more and more information is being placed in structurally rich documents, XML documents in particular, so as to improve the efficiency of similarity clustering, information retrieval and data management applications. Various algorithms for comparing hierarchically structured data, e.g., XML documents, have been proposed in the literature. Most of them make use of techniques for finding the edit distance between tree structures, XML documents being modeled as ordered labeled trees. Nevertheless, a thorough investigation of current approaches led us to identify several structural similarity aspects, i.e. sub-tree related similarities, which are not sufficiently addressed while comparing XML documents. In this paper, we provide an improved comparison method to deal with fine-grained sub-trees and leaf node repetitions, without increasing overall complexity with respect to current XML comparison methods. Our approach consists of two main algorithms for discovering the structural commonality between sub-trees and computing tree-based edit operations costs. A prototype has been developed to evaluate the optimality and performance of our method. Experimental results, on both real and synthetic XML data, demonstrate better performance with respect to alternative XML comparison methods. Keywords: XML, Semi-structured data, Structural similarity, Tree edit distance. 1