• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Sahami: QProber: A System for Automatic Classification of Hidden-Web Databases (0)

by L Gravano, P G Ipeirotis, M
Add To MetaCart

Tools

Sorted by:
Results 21 - 30 of 35
Next 10 →

Sampling, Information Extraction and Summarisation of Hidden Web Databases

by Yih-Ling Hedley , Muhammad Younas , Anne James , Mark Sanderson
"... Hidden Web databases maintain a collection of specialised documents, which are dynamically generated in response to users’ queries. The majority of these documents are generated through Web page templates, which contain information that is often irrelevant to queries. In this paper, we present a sys ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Hidden Web databases maintain a collection of specialised documents, which are dynamically generated in response to users’ queries. The majority of these documents are generated through Web page templates, which contain information that is often irrelevant to queries. In this paper, we present a system designed to detect and extract query-related information from documents sampled from databases. The proposed system, 2PS, is based on a two-phase framework for the sampling, extraction and summarisation of Hidden Web documents. In the first phase, 2PS queries databases with random terms selected from those contained in their search interface pages and the subsequently retrieved documents – this phase retrieves a pre-determined number of sampled documents. In the second phase, it detects Web page templates from the sampled documents in order to extract information relevant to respective queries from which a content summary is generated. 2PS is validated through the implmementation of a prototype system. Its evaluation is performed through experiments on a number of real-world Hidden Web databases. The experimental results demonstrate that 2PS effectively eliminates irrelevant information contained in Web page templates and generates terms and frequencies with improved accuracy.

Discovering and Ranking Data Intensive Web Services: A Source-Biased Approach

by James Caverlee, et al. , 2003
"... This paper presents a novel source-biased approach to automatically discover and rank relevant data intensive web services. It supports a service-centric view of the Web through source-biased probing and source-biased relevance detection and ranking metrics. Concretely, our approach is capable of an ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
This paper presents a novel source-biased approach to automatically discover and rank relevant data intensive web services. It supports a service-centric view of the Web through source-biased probing and source-biased relevance detection and ranking metrics. Concretely, our approach is capable of answering source-centric queries by focusing on the nature and degree of the topical relevance of one service to others. This source-biased probing allows us to determine in very few interactions whether a target service is relevant to the source by probing the target with very precise probes and then ranking the relevant services discovered based on a set of metrics we define. Our metrics allow us to determine the nature and degree of the relevance of one service to another. We also introduce a performance enhancement to our basic approach called source-biased probing with focal terms. We also extend the basic probing framework to a more generalized service neighborhood graph model. We discuss the semantics of the neighborhood graph, how we may reason about the relationships among multiple services, and how we rank services based on the service neighborhood graph model. We also report initial experiments to show the effectiveness of our approach.

Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

by James Caverlee, Ling Liu, David Buttler , 2004
"... In this paper, we introduce the concept of a QA-Pagelet to refer to the content region in a dynamic page that contains query matches. We present THOR, a scalable and efficient mining system for discovering and extracting QAPagelets from the Deep Web. A unique feature of THOR is its two-phase extract ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
In this paper, we introduce the concept of a QA-Pagelet to refer to the content region in a dynamic page that contains query matches. We present THOR, a scalable and efficient mining system for discovering and extracting QAPagelets from the Deep Web. A unique feature of THOR is its two-phase extraction framework. In the first phase, pages from a deep web site are grouped into distinct clusters of structurally-similar pages. In the second phase, pages from each page cluster are examined through a subtree filtering algorithm that exploits the structural and content similarity at subtree level to identify the QA-Pagelets.

Crawling the Content Hidden Behind Web Forms +

by Manuel Álvarez, Juan Raposo, Alberto Pan, Fidel Cacheda, O Bellas, Víctor Carneiro
"... Abstract. The crawler engines of today cannot reach most of the information contained in the Web. A great amount of valuable information is “hidden” behind the query forms of online databases, and/or is dynamically generated by technologies such as JavaScript. This portion of the web is usually know ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Abstract. The crawler engines of today cannot reach most of the information contained in the Web. A great amount of valuable information is “hidden” behind the query forms of online databases, and/or is dynamically generated by technologies such as JavaScript. This portion of the web is usually known as the Deep Web or the Hidden Web. We have built DeepBot, a prototype hiddenweb crawler able to access such content. DeepBot receives as input a set of domain definitions, each one describing a specific data-collecting task and automatically identifies and learns to execute queries on the forms relevant to them. In this paper we describe the techniques employed for building DeepBot and report the experimental results obtained when testing it with several real world data collection tasks. 1

Querying Large Text Databases for Efficient Information Extraction

by Eugene Agichtein, Luis Gravano
"... A wealth of data is hidden within unstructured text. This data is often best exploited in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining. Current information extraction techniques extract relations fro ..."
Abstract - Add to MetaCart
A wealth of data is hidden within unstructured text. This data is often best exploited in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining. Current information extraction techniques extract relations from a text database by examining every document in the database. This exhaustive approach is not practical, or sometimes even feasible, for large databases. In this paper, we develop an efficient query-based technique to identify documents that are potentially useful for the extraction of a target relation. We start by sampling the database to characterize the documents from which an information extraction system manages to extract relevant tuples. Then, we apply machine learning and information retrieval techniques to derive queries likely to match additional useful documents in the database. Finally, we issue these queries to the database to retrieve documents from which the information extraction system can extract the final relation. Our technique requires that databases support only a minimal boolean query interface, and is independent of the choice of the underlying information extraction system. We report a thorough experimental evaluation over more than one million documents that shows that we significantly improve the efficiency of the extraction process by focusing only on promising documents. Our proposed technique could be used to query a standard web search engine, hence providing a building block for efficient information extraction over the web at large. 1

Distributed Information Retrieval using Keyword Auctions

by Djoerd Hiemstra
"... Abstract This report motivates the need for large-scale distributed approaches to information retrieval, and proposes ..."
Abstract - Add to MetaCart
Abstract This report motivates the need for large-scale distributed approaches to information retrieval, and proposes

Efficient Web-Based Linkage of Short to Long Forms

by Yee Fan, Tan Ergin, Elmacioglu Min-yen, Kan Dongwon Lee
"... Abbreviations, acronyms, initialisms, and shortenings frequently occur in many texts found on the Web, such as publication metadata, stock ticker codes, and biological articles. To connect these disparate forms together for knowledge discovery, short forms must be properly linked to their canonical ..."
Abstract - Add to MetaCart
Abbreviations, acronyms, initialisms, and shortenings frequently occur in many texts found on the Web, such as publication metadata, stock ticker codes, and biological articles. To connect these disparate forms together for knowledge discovery, short forms must be properly linked to their canonical long forms. In this paper, we demonstrate how a search engine can be efficiently utilized in mining the required contextual information, so that short forms can be effectively linked to long forms. We show that a count-based method consistently outperforms other methods, and that using the snippets is better than using the full web pages. We also consider adaptively combining a query probing algorithm together with our count-based method. This reduces running time and network bandwidth, while maintaining the strong linkage performance. Keywords abbreviation matching, web as information resource, query probing, record linkage 1.

Efficient Algorithms for Clustering and Classifying High Dimensional Text and Discretized Data using Interesting Patterns

by Hassan H. Malik, Hassan H. Malik , 2008
"... Recent advances in data mining allow for exploiting patterns as the primary means for clustering and classifying large collections of data. In this thesis, we present three advances in pattern-based clustering technology, an advance in semi-supervised pattern-based classification, and a related adva ..."
Abstract - Add to MetaCart
Recent advances in data mining allow for exploiting patterns as the primary means for clustering and classifying large collections of data. In this thesis, we present three advances in pattern-based clustering technology, an advance in semi-supervised pattern-based classification, and a related advance in pattern frequency counting. In our first contribution, we analyze numerous deficiencies with traditional pattern significance measures such as support and confidence, and propose a web image clustering algorithm that uses an objective interestingness measure to identify significant patterns, yielding measurably better clustering quality. In our second contribution, we introduce the notion of closed interesting itemsets, and show that these itemsets provide significant dimensionality reduction over frequent and closed frequent itemsets. We propose GPHC, a sub-linearly scalable global pattern-based hierarchical clustering algorithm that uses closed interesting itemsets, and show that this algorithm achieves up to 11 % better FScores and up to 5 times better entropies as compared to state-of-the-art agglomerative, partitioningbased, and pattern-based hierarchical clustering algorithms on 9 common datasets.

Quality of language models for distributed information retrieval

by Paul Thomas , 2009
"... part of this publication covered by copyright may be reproduced or copied in any form or by any means except with the written permission of CSIRO. Important Disclaimer CSIRO advises that the information contained in this publication comprises general statements based on scientific research. The read ..."
Abstract - Add to MetaCart
part of this publication covered by copyright may be reproduced or copied in any form or by any means except with the written permission of CSIRO. Important Disclaimer CSIRO advises that the information contained in this publication comprises general statements based on scientific research. The reader is advised and needs to be aware that such information may be incomplete or unable to be used in any specific situation. No reliance or actions must therefore be made on that information without seeking prior expert professional, scientific and technical advice. To the extent permitted by law, CSIRO (including its employees and consultants) excludes all liability to any person for any consequences, including but not limited to all losses, damages, costs, expenses and any other compensation, arising directly or indirectly from using this publication (in part Collections used in distributed information retrieval (DIR) are often described by unigram language models, composed of simple term-probability statistics. In most cases, this information is not directly available from constituent collections and must be estimated by the DIR tool itself from a sample of documents. Factors affecting the quality of such estimates are not well understood, and nor is the impact of estimate quality. Several measures of quality for unigram language models have been described, and three are used here to investigate how the quality of a model changes given document samples of differing size or quality. I show that although all models improve given larger samples, those built with more biased samples are of significantly lower quality; and that one of the three measures, Kullback-Leibler divergence, best describes model quality. Finally, it is shown that model quality has an impact on the effectiveness of standard server selection algorithms. iii iv

and

by Weifeng Su, Frederick H. Lochvsky
"... Online databases respond to a user query with result records encoded in HTML files. Data extraction, which is important for many applications, extracts the records from the HTML files automatically. We present a novel data extraction method, ODE (Ontology-assisted Data Extraction), which automatical ..."
Abstract - Add to MetaCart
Online databases respond to a user query with result records encoded in HTML files. Data extraction, which is important for many applications, extracts the records from the HTML files automatically. We present a novel data extraction method, ODE (Ontology-assisted Data Extraction), which automatically extracts the query result records from the HTML pages. ODE first constructs an ontology for a domain according to information matching between the query interfaces and query result pages from different web sites within the same domain. Then, the constructed domain ontology is used during data extraction to identify the query result section in a query result page and to align and label the data values in the extracted records. The ontologyassisted data extraction method is fully automatic and overcomes many of the deficiencies of current automatic data extraction methods. Experimental results show that ODE is extremely accurate for identifying the query result section in an HTML page, segmenting the query result section into query result records, and aligning and labeling the data values in the query result records.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University