Results 1  10
of
47
Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences
 CoRR
"... Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to “natural language” text coll ..."
Abstract

Cited by 14 (9 self)
 Add to MetaCart
Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to “natural language” text collections, where inverted indexes are the preferred solution. As successful as this paradigm has been, it fails to properly handle various East Asian languages and other scenarios where the “natural language ” assumptions do not hold. In this survey we cover the recent research in extending the document retrieval techniques to a broader class of sequence collections, which has applications in bioinformatics, data and Web mining, chemoinformatics, software engineering, multimedia information retrieval, and many other fields. We focus on the algorithmic aspects of the techniques, uncovering a rich world of relations between document retrieval challenges and fundamental problems on trees, strings, range queries, discrete geometry, and other areas.
SpaceEfficient Topk Document Retrieval
"... Supporting topk document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usag ..."
Abstract

Cited by 13 (8 self)
 Add to MetaCart
(Show Context)
Supporting topk document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reducedspace structures that support topk retrieval and propose new alternatives. Our experimental results show that our novel structures and algorithms dominate almost all the space/time tradeoff.
Succinct representations of binary trees for range minimum queries, in:
 of Lecture Notes in Computer Science,
, 2012
"... Abstract. We provide two succinct representations of binary trees that can be used to represent the Cartesian tree of an array A of size n. Both the representations take the optimal 2n + o(n) bits of space in the worst case and support range minimum queries (RMQs) in O(1) time. The first one is a m ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
Abstract. We provide two succinct representations of binary trees that can be used to represent the Cartesian tree of an array A of size n. Both the representations take the optimal 2n + o(n) bits of space in the worst case and support range minimum queries (RMQs) in O(1) time. The first one is a modification of the representation of Farzan and Munro (SWAT 2008); a consequence of this result is that we can represent the Cartesian tree of a random permutation in 1.92n + o(n) bits in expectation. The second one uses a wellknown transformation between binary trees and ordinal trees, and ordinal tree operations to effect operations on the Cartesian tree. This provides an alternative, and more natural, way to view the 2DMinHeap of Fischer and Huen (SICOMP 2011). Furthermore, we show that the preprocessing needed to output the data structure can be performed in linear time using o(n) bits of extra working space, improving the result of Fischer and Heun who use n + o(n) bits working space.
Encoding 2D range maximum queries
 In Proceedings of the International Symposium on Algorithms and Computation (ISAAC), volume 7074 of Lecture Notes in Computer Science
, 2011
"... ar ..."
(Show Context)
Spaceefficient data structures for topk completion
 IN: PROCEEDINGS OF THE 22ST WORLD WIDE WEB CONFERENCE (WWW) (2013
"... Virtually every modern search application, either desktop, web, or mobile, features some kind of query autocompletion. In its basic form, the problem consists in retrieving from a string set a small number of completions, i.e. strings beginning with a given prefix, that have the highest scores acco ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
(Show Context)
Virtually every modern search application, either desktop, web, or mobile, features some kind of query autocompletion. In its basic form, the problem consists in retrieving from a string set a small number of completions, i.e. strings beginning with a given prefix, that have the highest scores according to some static ranking. In this paper, we focus on the case where the string set is so large that compression is needed to fit the data structure in memory. This is a compelling case for web search engines and social networks, where it is necessary to index hundreds of millions of distinct queries to guarantee a reasonable coverage; and for mobile devices, where the amount of memory is limited. We present three different triebased data structures to address this problem, each one with different space/time/ complexity tradeoffs. Experiments on largescale datasets show that it is possible to compress the string sets, including the scores, down to spaces competitive with the gzip’ed data, while supporting efficient retrieval of completions at about a microsecond per completion.
Efficient inmemory topk document retrieval
, 2012
"... For over forty years the dominant data structure for ranked document retrieval has been the inverted index. Inverted indexes are effective for a variety of document retrieval tasks, and particularly efficient for large data collection scenarios that require disk access and storage. However, many eff ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
For over forty years the dominant data structure for ranked document retrieval has been the inverted index. Inverted indexes are effective for a variety of document retrieval tasks, and particularly efficient for large data collection scenarios that require disk access and storage. However, many efficiencybound search tasks can now easily be supported entirely inmemory as a result of recent hardware advances. In this paper we present a hybrid algorithmic framework for inmemory bagofwords ranked document retrieval using a selfindex derived from the FMIndex, wavelet tree, and the compressed suffix tree data structures, and evaluate the various algorithmic tradeoffs for performing efficient queries entirely inmemory. We compare our approach with two classic approaches to bagofwords queries using inverted indexes, termatatime (TAAT) and documentatatime (DAAT) query processing. We show that our framework is competitive with stateoftheart indexing structures, and describe new capabilities provided by our algorithms that can be leveraged by future systems to improve effectiveness and efficiency for a variety of fundamental search operations.
Faster Compact Topk Document Retrieval
"... An optimal index solving topk document retrieval [Navarro and Nekrich, SODA’12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n– 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is u ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
(Show Context)
An optimal index solving topk document retrieval [Navarro and Nekrich, SODA’12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1.5n– 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is up to 25 times faster than the best previous compressed solutions, and requires at most 5 % more space in practice (and in some cases as little as one half). Apart from replacing classical by compressed data structures, our main idea is to replace suffix tree sampling by frequency thresholding to achieve compression.
Encodings for Range Selection and Topk Queries
"... Abstract. We study the problem of encoding the positions the topk elements of an array A[1..n] for a given parameter 1 ≤ k ≤ n. Specifically, for any i and j, we wish create a data structure that reports the positions of the largest k elements in A[i..j] in decreasing order, without accessing A at ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
(Show Context)
Abstract. We study the problem of encoding the positions the topk elements of an array A[1..n] for a given parameter 1 ≤ k ≤ n. Specifically, for any i and j, we wish create a data structure that reports the positions of the largest k elements in A[i..j] in decreasing order, without accessing A at query time. This is a natural extension of the wellknown encoding rangemaxima query problem, where only the position of the maximum in A[i..j] is sought, and finds applications in document retrieval and ranking. We give (sometimes tight) upper and lower bounds for this problem and some variants thereof. 1
Inducing suffix and LCP arrays in external memory
 In Proc. ALENEX
, 2013
"... We consider full text index construction in external memory (EM). Our first contribution is an inducing algorithm for suffix arrays in external memory, which utilizes an efficient EM priority queue and runs in sorting complexity. Practical tests show that this algorithm outperforms the previous best ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
We consider full text index construction in external memory (EM). Our first contribution is an inducing algorithm for suffix arrays in external memory, which utilizes an efficient EM priority queue and runs in sorting complexity. Practical tests show that this algorithm outperforms the previous best EM suffix sorter [Dementiev et al., JEA 2008] by a factor of about two in time and I/Ovolume. Our second contribution is to augment the first algorithm to also construct the array of longest common prefixes (LCPs). This yields the first EM construction algorithm for LCP arrays. The overhead in time and I/O volume for this extended algorithm over plain suffix array construction is roughly two. Our algorithms scale far beyond problem sizes previously considered in the literature (text size of 80 GiB using only 4 GiB of RAM in our experiments). 1
Making queries tractable on big data with preprocessing
 In VLDB
, 2013
"... A query class is traditionally considered tractable if there exists a polynomialtime (PTIME) algorithm to answer its queries. When it comes to big data, however, PTIME algorithms often become infeasible in practice. A traditional and effective approach to coping with this is to preprocess data off ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
(Show Context)
A query class is traditionally considered tractable if there exists a polynomialtime (PTIME) algorithm to answer its queries. When it comes to big data, however, PTIME algorithms often become infeasible in practice. A traditional and effective approach to coping with this is to preprocess data offline, so that queries in the class can be subsequently evaluated on the data efficiently. This paper aims to provide a formal foundation for this approach in terms of computational complexity. (1) We propose a set of Πtractable queries, denoted by ΠT0Q, to characterize classes of queries that can be answered in parallel polylogarithmic time (NC) after PTIME preprocessing. (2) We show that several natural query classes are Πtractable and are feasible on big data. (3) We also study a set ΠTQ of query classes that can be effectively converted to Πtractable queries by refactorizing its data and queries for preprocessing. We introduce a form of NC reductions to characterize such conversions. (4) We show that a natural query class is complete for ΠTQ. (5) We also show that ΠT0Q ⊂ P unless P = NC, i.e., the set ΠT0Q of all Πtractable queries is properly contained in the set P of all PTIME queries. Nonetheless, ΠTQ = P, i.e., all PTIME query classes can be made Πtractable via proper refactorizations. This work is a step towards understanding the tractability of queries in the context of big data. 1.