This paper presents the novel SphereSearch Engine that provides unified ranked retrieval on heterogeneous XML and Web data. Its search capabilities include vague structure conditions, text content conditions, and relevance ranking based on IR statistics and statistically quantified ontological relationships. Web pages in HTML or PDF are automatically converted into XML format, with the option of generating semantic tags by means of linguistic annotation tools. For Web data the XML-oriented query engine is leveraged to provide very rich search options that cannot be expressed in traditional Web search engines: concept-aware and link-aware querying that takes into account the implicit structure and context of Web pages. The benefits of the SphereSearch engine are demonstrated by experiments with a large and richly tagged but non-schematic open encyclopedia extended with external documents.
|
1199
|
WordNet: an Electronic Lexical Database
– Fellbaum
- 1998
|
|
233
|
Optimal Aggregation Algorithms for Middleware
– Fagin, Lotem, et al.
- 2001
|
|
209
|
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval
– Robertson, Walker
- 1994
|
|
141
|
WebOQL: Restructuring Documents, Databases and Webs
– Arocena, Mendelzon
- 1998
|
|
128
|
To weave the web
– Atzeni, Mecca, et al.
- 1997
|
|
117
|
XIRQL: a query language for information retrieval in XML documents
– Fuhr, Großjohann
- 2001
|
|
103
|
Extracting structured data from web pages
– Arasu, Garcia-Molina
- 2003
|
|
97
|
Optimizing multi-feature queries for image databases
– Güntzer, Balke, et al.
- 2000
|
|
82
|
W3QS: A Query System for the World-Wide Web
– Konopnicki, Shmueli
- 1995
|
|
78
|
The index-based XXL search engine for querying XML data with relevance ranking
– Theobald, Weikum
- 2002
|
|
58
|
Efficient IR-style keyword search over relational databases
– Hristidis, Gravano, et al.
- 2003
|
|
56
|
Query processing issues in image (multimedia) databases
– Nepal, Ramakrishna
- 1999
|
|
54
|
Building light-weight wrappers for legacy web data-sources using w4f
– Sahuguet, Azavant
- 1999
|
|
53
|
unknown title
– Wikipedia
|
|
48
|
Top-k Query Evaluation with Probabilistic Guarantees
– Theobald, Weikum, et al.
- 2004
|
|
42
|
a general architecture for text engineering
– Gate
- 2002
|
|
37
|
Concept-based query expansion
– Qiu, Frei
- 1993
|
|
36
|
FleXPath: Flexible structure and full-text querying for XML
– Amer-Yahia, Lakshmanan, et al.
- 2004
|
|
34
|
Exploiting dictionaries in named entity extraction: Combining semi-markov extraction processes and data integration methods
– Cohen, Sarawagi
- 2004
|
|
31
|
Querying and Ranking XML Documents
– Schlieder, Meuss
- 2002
|
|
26
|
XMach-1: A Benchmark for XML Data Management
– Böhme, Rahm
- 2001
|
|
15
|
RoadRunner: Automatic data extraction from data-intensive Web sites
– Crescenzi, Mecca, et al.
- 2002
|
|
15
|
et al. XRANK: ranked keyword search over XML documents
– Guo
- 2003
|
|
13
|
An Expressive and Efficient Language for XML Information Retrieval
– Chinenyanga, Kushmerick
- 2001
|
|
11
|
et al. The Lorel Query Language for Semistructured Data
– Abiteboul
- 1997
|
|
11
|
Breaking through the syntax barrier: Searching with entities and relations
– Chakrabarti
- 2004
|
|
11
|
et al. XSEarch: A semantic search engine for XML
– Cohen
- 2003
|
|
8
|
et al. Web-scale information extraction in KnowItAll
– Etzioni
- 2004
|
|
8
|
Efficient creation and incremental maintenance of the HOPI index for complex XML document collections
– Schenkel, Theobald, et al.
- 2005
|
|
7
|
A semantic taxonomy-based personalizable meta-search agent
– Kerschberg, Kim, et al.
- 2001
|
|
7
|
et al. XMark: A Benchmark for XML Data Management
– Schmidt
- 2002
|
|
6
|
et al. The INEX evaluation initiative
– Kazai
- 2003
|
|
5
|
Merging XML indices
– Amati, Carpineto, et al.
- 2004
|
|
5
|
et al. Keyword searching and browsing in databases using BANKS
– Bhalotia
- 2002
|
|
5
|
Ontology-Enabled XML Search
– Schenkel, Theobald, et al.
|
|
5
|
An algebra for structured queries in bayesian networks
– Vittaut, Piwowarski, et al.
- 2004
|
|
4
|
Information extraction and automatic markup for XML documents
– Abolhassani, Fuhr, et al.
- 2003
|
|
4
|
et al. The Lixto data extraction project – back and forth between theory and practice
– Gottlob
- 2004
|
|
3
|
Applying the divergence from randomness approach for content-only search in XML documents
– unknown authors
- 2004
|
|
3
|
Computer science bibliography. http://www.informatik.uni-trier.de/ ley/db/index.html
– Ley
- 2007
|
|
3
|
BINGO!: Bookmark-induced gathering of information
– Sizov, Theobald, et al.
- 2002
|
|
2
|
Automatic query refinement using mined semantic relations
– Graupmann, Cai, et al.
- 2005
|
|
2
|
Querying XML using structures and keywords
– Yu, Jagadish, et al.
- 2003
|
|
1
|
et al. An effective approach to document retrieval via utilizing WordNet and recognizing phrases
– Liu
- 2004
|