Results 1 - 10
of
279
An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web
- In Proceedings of the 30th ACM SIGMOD International Conference on Management of Data
, 2004
"... An increasing number of data sources now become avail-able on the Web, but often their contents are only acces-sible through query interfaces. For a domain of interest, there often exist many such sources with varied coverage or querying capabilities. As an important step to the in-tegration of thes ..."
Abstract
-
Cited by 112 (16 self)
- Add to MetaCart
(Show Context)
An increasing number of data sources now become avail-able on the Web, but often their contents are only acces-sible through query interfaces. For a domain of interest, there often exist many such sources with varied coverage or querying capabilities. As an important step to the in-tegration of these sources, we consider the integration of their query interfaces. More specically, we focus on the crucial step of the integration: accurately matching the in-terfaces. While the integration of query interfaces has re-ceived more attentions recently, current approaches are not suciently general: (a) they all model interfaces with
at schemas; (b) most of them only consider 1:1 mappings of elds over the interfaces; (c) they all perform the integra-tion in a blackbox-like fashion and the whole process has to be restarted from scratch if anything goes wrong; and (d) they often require laborious parameter tuning. In this pa-per, we propose an interactive, clustering-based approach to matching query interfaces. The hierarchical nature of inter-faces is captured with ordered trees. Varied types of com-plex mappings of elds are examined and several approaches are proposed to eectively identify these mappings. We put the human integrator back in the loop and propose several novel approaches to the interactive learning of parameters and the resolution of uncertain mappings. Extensive exper-iments are conducted and results show that our approach is highly eective. 1.
Web application security assessment by fault injection and behavior monitoring
- in WWW’03: Proceedings of the 12th international conference on World Wide Web
"... As a large and complex application platform, the World Wide Web is capable of delivering a broad range of sophisticated applications. However, many Web applications go through rapid development phases with extremely short turnaround time, making it difficult to eliminate vulnerabilities. Here we ana ..."
Abstract
-
Cited by 106 (3 self)
- Add to MetaCart
(Show Context)
As a large and complex application platform, the World Wide Web is capable of delivering a broad range of sophisticated applications. However, many Web applications go through rapid development phases with extremely short turnaround time, making it difficult to eliminate vulnerabilities. Here we analyze the design of Web application security assessment mechanisms in order to identify poor coding practices that render Web applications vulnerable to attacks such as SQL injection and cross-site scripting. We describe the use of a number of software-testing techniques (including dynamic analysis, black-box testing, fault injection, and behavior monitoring), and suggest mechanisms for applying these techniques to Web applications. Real-world situations are used to test a tool we named the Web Application Vulnerability and Error Scanner (WAVES, an open-source project available at
Wise-integrator: An automatic integrator of web search interfaces for e-commerce
- In VLDB
, 2003
"... More and more databases are becoming Web accessible through form-based search interfaces, and many of these sources are E-commerce sites. Providing a unified access to multiple Ecommerce search engines selling similar products is of great importance in allowing users to search and compare products f ..."
Abstract
-
Cited by 106 (16 self)
- Add to MetaCart
More and more databases are becoming Web accessible through form-based search interfaces, and many of these sources are E-commerce sites. Providing a unified access to multiple Ecommerce search engines selling similar products is of great importance in allowing users to search and compare products from multiple sites with ease. One key task for providing such a capability is to integrate the Web interfaces of these Ecommerce search engines so that user queries can be submitted against the integrated interface. Currently, integrating such search interfaces is carried out either manually or semi-automatically, which is inefficient and difficult to maintain. In this paper, we present WISE-Integrator- a tool that performs automatic integration of Web Interfaces of Search Engines. WISE-Integrator employs sophisticated techniques to identify matching attributes from different search interfaces for integration. It also resolves domain differences of matching attributes. Our experimental results based on 20 and 50 interfaces in two different domains indicate that WISE-Integrator can achieve high attribute matching accuracy and can produce high-quality integrated search interfaces without human interactions. 1.
Understanding web query interfaces: Best-effort parsing with hidden syntax
- In SIGMOD Conference
, 2004
"... Recently, the Web has been rapidly “deepened ” by many searchable databases online, where data are hidden behind query forms. For modelling and integrating Web databases, the very first challenge is to understand what a query interface says – or what query capabilities a source supports. Such automa ..."
Abstract
-
Cited by 90 (15 self)
- Add to MetaCart
(Show Context)
Recently, the Web has been rapidly “deepened ” by many searchable databases online, where data are hidden behind query forms. For modelling and integrating Web databases, the very first challenge is to understand what a query interface says – or what query capabilities a source supports. Such automatic extraction of interface semantics is challenging, as query forms are created autonomously. Our approach builds on the observation that, across myriad sources, query forms seem to reveal some “concerted structure, ” by sharing common building blocks. Toward this insight, we hypothesize the existence of a hidden syntax that guides the creation of query interfaces, albeit from different sources. This hypothesis effectively transforms query interfaces into a visual language with a non-prescribed grammar – and, thus, their semantic understanding a parsing problem. Such a paradigm enables principled solutions for both declaratively representing common patterns, by a derived grammar, and systematically interpreting query forms, by a global parsing mechanism. To realize this paradigm, we must address the challenges of a hypothetical syntax – that it is to be derived, and that it is secondary to the input. At the heart of our form extractor, we thus develop a 2P grammar and a best-effort parser, which together realize a parsing mechanism for a hypothetical syntax. Our experiments show the promise of this approach – it achieves above 85 % accuracy for extracting query conditions across random sources. 1.
Structured databases on the web: Observations and implications
- SIGMOD Record[J
"... The Web has been rapidly “deepened ” by the prevalence of databases online. With the potentially unlimited information hidden behind their query interfaces, this “deep Web ” of searchable databases is clearly an important frontier for data access. This paper surveys this relatively unexplored fronti ..."
Abstract
-
Cited by 86 (25 self)
- Add to MetaCart
(Show Context)
The Web has been rapidly “deepened ” by the prevalence of databases online. With the potentially unlimited information hidden behind their query interfaces, this “deep Web ” of searchable databases is clearly an important frontier for data access. This paper surveys this relatively unexplored frontier, measuring characteristics pertinent to both exploring and integrating structured Web sources. On one hand, our “macro ” study surveys the deep Web at large, in April 2004, adopting the random IP-sampling approach, with one million samples. (How large is the deep Web? How is it covered by current directory services?) On the other hand, our “micro ” study surveys source-specific characteristics over 441 sources in eight representative domains, in December 2002. (How “hidden ” are deep-Web sources? How do search engines cover their data? How complex and expressive are query forms?) We report our observations and publish the resulting datasets to the research community. We conclude with several implications (of our own) which, while necessarily subjective, might help shape research directions and solutions. 1.
Downloading textual hidden web content through keyword queries
- In JCDL
, 2005
"... An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the Hidden Web or the Deep Web. Since there ar ..."
Abstract
-
Cited by 80 (4 self)
- Add to MetaCart
(Show Context)
An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the Hidden Web or the Deep Web. Since there are no static links to the Hidden Web pages, search engines cannot discover and index such pages and thus do not return them in the results. However, according to recent studies, the content provided by many Hidden Web sites is often of very high quality and can be extremely valuable to many users. In this paper, we study how we can build an effective Hidden Web crawler that can autonomously discover and download pages from the Hidden Web. Since the only “entry point ” to a Hidden Web site is a query interface, the main challenge that a Hidden Web crawler has to face is how to automatically generate meaningful queries to issue to the site. Here, we provide a theoretical framework to investigate the query generation problem for the Hidden Web and we propose effective policies for generating queries automatically. Our policies proceed iteratively, issuing a different query in every iteration. We experimentally evaluate the effectiveness of these policies on 4 real Hidden Web sites and our results are very promising. For instance, in one experiment, one of our policies downloaded more than 90 % of a Hidden Web site (that contains 14 million documents) after issuing fewer than 100 queries.
Google’s Deep-Web Crawl
, 2008
"... The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper ..."
Abstract
-
Cited by 80 (6 self)
- Add to MetaCart
The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper describes a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. The results of our surfacing have been incorporated into the Google search engine and today drive more than a thousand queries per second to Deep-Web content. Surfacing the Deep Web poses several challenges. First, our goal is to index the content behind many millions of HTML forms that span many languages and hundreds of domains. This necessitates an approach that is completely automatic, highly scalable, and very efficient. Second, a large number of forms have text inputs and require valid inputs values to be submitted. We present an algorithm for selecting input values for text search inputs that accept keywords and an algorithm for identifying inputs which accept only values of a specific type. Third, HTML forms often have more than one input and hence a naive strategy of enumerating the entire Cartesian product of all possible inputs can result in a very large number of URLs being generated. We present an algorithm that efficiently navigates the search space of possible input combinations to identify only those that generate URLs suitable for inclusion into our web search index. We present an extensive experimental evaluation validating the effectiveness of our algorithms.
Siphoning hidden-web data through keyword-based interfaces
- In SBBD
, 2004
"... In this paper, we study the problem of automating the retrieval of data hidden behind simple search interfaces that accept keyword-based queries. Our goal is to automatically retrieve all available results (or, as many as possible). We propose a new approach to siphon hidden data that automatically ..."
Abstract
-
Cited by 73 (13 self)
- Add to MetaCart
(Show Context)
In this paper, we study the problem of automating the retrieval of data hidden behind simple search interfaces that accept keyword-based queries. Our goal is to automatically retrieve all available results (or, as many as possible). We propose a new approach to siphon hidden data that automatically generates a small set of representative keywords and builds queries which lead to high coverage. We evaluate our algorithms over several real Web sites. Preliminary results indicate our approach is effective: coverage of over 90 % is obtained for most of the sites considered. 1.
QProber: A system for automatic classification of hidden-web databases
- ACM TOIS
, 2003
"... The contents of many valuable web-accessible databases are only available through search interfaces and are hence invisible to traditional web “crawlers. ” Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. ..."
Abstract
-
Cited by 63 (13 self)
- Add to MetaCart
The contents of many valuable web-accessible databases are only available through search interfaces and are hence invisible to traditional web “crawlers. ” Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. Here, we introduce QProber, a modular system that automates this classification process by using a small number of query probes, generated by document classifiers. QProber can use a variety of types of classifiers to generate the probes. To classify a database, QProber does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of QProber over collections of real documents, experimenting with different types of document classifiers and retrieval models. We have also tested our system with over one hundred web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases.
Deursen, “Crawling Ajax by Inferring User Interface State Changes
- Proc. Eighth Int’l Conf. Web Eng
, 2008
"... AJAX is a very promising approach for improving rich interactivity and responsiveness of web applications. At the same time, AJAX techniques shatter the metaphor of a web ‘page ’ upon which general search crawlers are based. This paper describes a novel technique for crawling AJAX ap-plications thro ..."
Abstract
-
Cited by 61 (10 self)
- Add to MetaCart
(Show Context)
AJAX is a very promising approach for improving rich interactivity and responsiveness of web applications. At the same time, AJAX techniques shatter the metaphor of a web ‘page ’ upon which general search crawlers are based. This paper describes a novel technique for crawling AJAX ap-plications through dynamic analysis and reconstruction of user interface state changes. Our method dynamically in-fers a ‘state-flow graph ’ modeling the various navigation paths and states within an AJAX application. This recon-structed model can be used to generate linked static pages. These pages could be used to expose AJAX sites to gen-eral search engines. Moreover, we believe that the crawling techniques that are part of our solution have other appli-cations, such as within general search engines, accessibil-