Results 1  10
of
34
A new succinct representation of RMQinformation and improvements in the enhanced suffix array
 PROC. ESCAPE. LNCS
, 2007
"... The RangeMinimumQueryProblem is to preprocess an array of length n in O(n) time such that all subsequent queries asking for the position of a minimal element between two specified indices can be obtained in constant time. This problem was first solved by Berkman and Vishkin [1], and Sadakane [2] ..."
Abstract

Cited by 52 (15 self)
 Add to MetaCart
(Show Context)
The RangeMinimumQueryProblem is to preprocess an array of length n in O(n) time such that all subsequent queries asking for the position of a minimal element between two specified indices can be obtained in constant time. This problem was first solved by Berkman and Vishkin [1], and Sadakane [2] gave the first succinct data structure that uses 4n+o(n) bits of additional space. In practice, this method has several drawbacks: it needs O(nlog n) bits of intermediate space when constructing the data structure, and it builds on previous results on succinct data structures. We overcome these problems by giving the first algorithm that never uses more than 2n + o(n) bits, and does not rely on rank and selectqueries or other succinct data structures. We stress the importance of this result by simplifying and reducing the space consumption of the Enhanced Suffix Array [3], while retaining its capability of simulating topdowntraversals of the suffix tree, used, e.g., to locate all occ positions of a pattern p in a text in optimal O(p  + occ) time (assuming constant alphabet size). We further prove a lower bound of 2n − o(n) bits, which makes our algorithm asymptotically optimal.
SpaceEfficient Preprocessing Schemes for Range Minimum Queries on Static Arrays
, 2009
"... Given a static array of n totally ordered object, the range minimum query problem is to build an additional data structure that allows to answer subsequent online queries of the form “what is the position of a minimum element in the subarray ranging from i to j? ” efficiently. We focus on two sett ..."
Abstract

Cited by 47 (3 self)
 Add to MetaCart
(Show Context)
Given a static array of n totally ordered object, the range minimum query problem is to build an additional data structure that allows to answer subsequent online queries of the form “what is the position of a minimum element in the subarray ranging from i to j? ” efficiently. We focus on two settings, where (1) the input array is available at query time, and (2) the input array is only available at construction time. In setting (1), we show new data structures (a) of n c(n) (2 + o(1)) bits and query time O(c(n)), or (b) with O(nHk) + o(n) bits and O(1) query size time, where Hk denotes the empirical entropy of k’th order of the input array. In setting (2), we give a data structure of optimal size 2n + o(n) bits and query time O(1). All data structures can be constructed in linear time and almost inplace.
Faster EntropyBounded Compressed Suffix Trees
, 2009
"... Suffix trees are among the most important data structures in stringology, with a number of applications in flourishing areas like bioinformatics. Their main problem is space usage, which has triggered much research striving for compressed representations that are still functional. A smaller suffix t ..."
Abstract

Cited by 31 (15 self)
 Add to MetaCart
Suffix trees are among the most important data structures in stringology, with a number of applications in flourishing areas like bioinformatics. Their main problem is space usage, which has triggered much research striving for compressed representations that are still functional. A smaller suffix tree representation could fit in a faster memory, outweighing by far the theoretical slowdown brought by the space reduction. We present a novel compressed suffix tree, which is the first achieving at the same time sublogarithmic complexity for the operations, and space usage that asymptotically goes to zero as the entropy of the text does. The main ideas in our development are compressing the longest common prefix information, totally getting rid of the suffix tree topology, and expressing all the suffix tree operations using range minimum queries and a novel primitive called next/previous smaller value in a sequence. Our solutions to those operations are of independent interest.
Random Access to GrammarCompressed Strings
, 2011
"... Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is ..."
Abstract

Cited by 31 (3 self)
 Add to MetaCart
(Show Context)
Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is the inverse of the k th row of Ackermann’s function. Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammarcompressed strings without decompression. For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{P k, k 4 + P } + log N) + occ), where occ is the number of occurrences of P in S. Finally, we are able to generalize our results to navigation and other operations on grammarcompressed trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two ”biased” weighted ancestor data structures, and a compact representation of heavypaths in grammars.
LinearSpace Data Structures for Range Mode Query in Arrays
"... A mode of a multiset S is an element a ∈ S of maximum multiplicity; that is, a occurs at least as frequently as any other element in S. Given an array A[1: n] of n elements, we consider a basic problem: constructing a static data structure that efficiently answers range mode queries on A. Each query ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
(Show Context)
A mode of a multiset S is an element a ∈ S of maximum multiplicity; that is, a occurs at least as frequently as any other element in S. Given an array A[1: n] of n elements, we consider a basic problem: constructing a static data structure that efficiently answers range mode queries on A. Each query consists of an input pair of indices (i, j) for which a mode of A[i: j] must be returned. The best previous data structure with linear space, by Krizanc, Morin, and Smid (ISAAC 2003), requires O ( √ n log log n) query time. We improve their result and present an O(n)space data structure that supports range mode queries in O ( p n / log n) worstcase time. Furthermore, we present strong evidence that a query time significantly below √ n cannot be achieved by purely combinatorial techniques; we show that boolean matrix multiplication of two √ n × √ n matrices reduces to n range mode queries in an array of size O(n). Additionally, we give linearspace data structures for orthogonal range mode in higher dimensions (queries in near O(n 1−1/2d) time) and for halfspace range mode in higher dimensions (queries in O(n 1−1/d2) time).
Modeling and Querying Possible Repairs in Duplicate Detection
"... One of the most prominent data quality problems is the existence of duplicate records. Current duplicate elimination procedures usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter setti ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
(Show Context)
One of the most prominent data quality problems is the existence of duplicate records. Current duplicate elimination procedures usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. Furthermore, replacing the input dirty data with one possible clean instance may result in unrecoverable errors, for example, identification and merging of possible duplicate records in health care systems. In this paper, we treat duplicate detection procedures as data processing tasks with uncertain outcomes. We concentrate on a family of duplicate detection algorithms that are based on parameterized clustering. We propose a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. We show how to efficiently support relational queries under our model, and to allow new types of queries on the set of possible repairs. We give an experimental study illustrating the scalability and the efficiency of our techniques in different configurations. 1.
On Space Efficient Two Dimensional Range Minimum Data Structures
"... Abstract. The two dimensional range minimum query problem is to preprocess a static two dimensional m by n array A of size N = m · n, such that subsequent queries, asking for the position of the minimum element in a rectangular range within A, can be answered efficiently. We study the tradeoff betw ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
Abstract. The two dimensional range minimum query problem is to preprocess a static two dimensional m by n array A of size N = m · n, such that subsequent queries, asking for the position of the minimum element in a rectangular range within A, can be answered efficiently. We study the tradeoff between the space and query time of the problem. We show that every algorithm enabled to access A during the query and using O(N/c) bits additional space requires Ω(c) query time, for any c where 1 ≤ c ≤ N. This lower bound holds for any dimension. In particular, for the one dimensional version of the problem, the lower bound is tight up to a constant factor. In two dimensions, we complement the lower bound with an indexing data structure of size O(N/c) bits additional space which can be preprocessed in O(N) time and achieves O(c log 2 c) query time. For c = O(1), this is the first O(1) query time algorithm using optimal O(N) bits additional space. For the case where queries can not probe A, we give a data structure of size O(N · min{m, log n}) bits with O(1) query time, assuming m ≤ n. This leaves a gap to the lower bound of Ω(N log m) bits for this version of the problem. 1
Optimal string mining under frequency constraints
 Closed Sets for Labeled Data?, PKDD, 2006
, 2006
"... Abstract. We propose a new algorithmic framework that solves frequencyrelated data mining queries on databases of strings in optimal time, i.e., in time linear in the input and the output size. The additional space is linear in the input size. Our framework can be used to mine frequent strings, eme ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
(Show Context)
Abstract. We propose a new algorithmic framework that solves frequencyrelated data mining queries on databases of strings in optimal time, i.e., in time linear in the input and the output size. The additional space is linear in the input size. Our framework can be used to mine frequent strings, emerging strings and strings that pass other statistical tests, e.g., the χ 2test. In contrast to the presented result for strings, no optimal algorithms are known for other pattern domains such as itemsets. The key to our approach are several recent results on index structures for strings, among them suffix and lcparrays, and a new preprocessing scheme for range minimum queries. The advantages of arraybased data structures (compared with dynamic data structures such as trees) are good locality behavior and extensibility to secondary memory. We test our algorithm on realworld data from computational biology and demonstrate that the approach also works well in practice. 1
TwoDimensional Range Minimum Queries
, 2007
"... We consider the twodimensional Range Minimum Query problem: for a static (m × n)matrix of size N = mn which may be preprocessed, answer online queries of the form “where is the position of a minimum element in an axisparallel rectangle?”. Unlike the onedimensional version of this problem which ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
(Show Context)
We consider the twodimensional Range Minimum Query problem: for a static (m × n)matrix of size N = mn which may be preprocessed, answer online queries of the form “where is the position of a minimum element in an axisparallel rectangle?”. Unlike the onedimensional version of this problem which can be solved in provably optimal time and space, the higherdimensional case has received much less attention. The only result we are aware of is due to Gabow, Bentley and Tarjan [1], who solve the problem in O(N log N) preprocessing time and space and O(log N) query time. We present a class of algorithms which can solve the 2dimensional RMQproblem with O(kN) additional space, O(N log [k+1] N) preprocessing time and O(1) query time for any k>1, where log [k+1] denotes the iterated application of k + 1 logarithms. The solution converges towards an algorithm with O(N log ∗ N) preprocessing time and space and O(1) query time. All these algorithms are significant improvements over the previous results: query time is optimal, preprocessing time is quasilinear in the input size, and space is linear. While this paper is of theoretical nature, we believe that our algorithms will turn out to have applications in different fields of computer science, e.g., in computational biology.
Lempel–Ziv Factorization Using Less Time & Space
, 1661
"... Abstract. For 30 years the Lempel–Ziv factorization LZx of a string x = x[1..n] has been a fundamental data structure of string processing, especially valuable for string compression and for computing all the repetitions (runs) in x. Traditionally the standard method for computing LZx was based on Θ ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
Abstract. For 30 years the Lempel–Ziv factorization LZx of a string x = x[1..n] has been a fundamental data structure of string processing, especially valuable for string compression and for computing all the repetitions (runs) in x. Traditionally the standard method for computing LZx was based on Θ(n)time (or, depending on the measure used, O(n log n)time) processing of the suffix tree STx of x. Recently Abouelhoda et al. proposed an efficient Lempel–Ziv factorization algorithm based on an “enhanced ” suffix array – that is, a suffix array SAx together with supporting data structures, principally an “interval tree”. In this paper we introduce a collection of fast spaceefficient algorithms for LZ factorization, also based on suffix arrays, that in theory as well as in many practical circumstances are superior to those previously proposed; one family out of this collection achieves true Θ(n)time alphabetindependent processing in the worst case by avoiding tree structures altogether. Mathematics Subject Classification (2000). 68W05. Keywords. Lempel–Ziv factorization, suffix array, suffix tree, LZ factorization.