Results 1 
9 of
9
RCSI: Scalable similarity search in thousand(s) of genomes
"... Until recently, genomics has concentrated on comparing sequences between species. However, due to the sharply falling cost of sequencing technology, studies of populations of individuals of the same species are now feasible and promise advances in areas such as personalized medicine and treatment of ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Until recently, genomics has concentrated on comparing sequences between species. However, due to the sharply falling cost of sequencing technology, studies of populations of individuals of the same species are now feasible and promise advances in areas such as personalized medicine and treatment of genetic diseases. A core operation in such studies is read mapping, i.e., finding all parts of a set of genomes which are within edit distance k to a given query sequence (kapproximate search). To achieve sufficient speed, current algorithms solve this problem only for one tobesearched genome and compute only approximate solutions, i.e., they miss some kapproximate occurrences. We present RCSI, Referentially Compressed Search Index, which scales to a thousand genomes and computes the exact answer. It exploits the fact that genomes of different individuals of the same species are highly similar by first compressing the tobesearched genomes with respect to a reference genome. Given a query, RCSI then searches the reference and all genomespecific individual differences. We propose efficient data structures for representing compressed genomes and present algorithms for scalable compression and similarity search. We evaluate our algorithms on a set of 1092 human genomes, which amount to approx. 3 TB of raw data. RCSI compresses this set by a ratio of 450:1 (26:1 including the search index) and answers similarity queries on a midclass server in 15 ms on average even for comparably large error thresholds, thereby significantly outperforming other methods. Furthermore, we present a fast and adaptive heuristic for choosing the best reference sequence for referential compression, a problem that was never studied before at this scale. 1.
Grammar Compressed Sequences with Rank/Select Support?
"... Abstract. Sequence representations supporting not only direct access to their symbols, but also rank/select operations, are a fundamental building block in many compressed data structures. In several recent applications, the need to represent highly repetitive sequences arises, where statistical co ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Sequence representations supporting not only direct access to their symbols, but also rank/select operations, are a fundamental building block in many compressed data structures. In several recent applications, the need to represent highly repetitive sequences arises, where statistical compression is ineffective. We introduce grammarbased representations for repetitive sequences, which use up to 10 % of the space needed by representations based on statistical compression, and support direct access and rank/select operations within tens of microseconds. 1
Lightweight LempelZiv parsing
, 2013
"... We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d ≥ 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d ≥ 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest.
Suffix tree of alignment: An efficient index for similar data
 In Proc. International Workshop on Combinatorial Algorithms (IWOCA). LNCS 8288
, 2013
"... ar ..."
Document Retrieval on Repetitive Collections?
"... Abstract. Document retrieval aims at finding the most important documents where a pattern appears in a collection of strings. Traditional patternmatching techniques yield bruteforce document retrieval solutions, which has motivated the research on tailored indexes that offer nearoptimal perform ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Document retrieval aims at finding the most important documents where a pattern appears in a collection of strings. Traditional patternmatching techniques yield bruteforce document retrieval solutions, which has motivated the research on tailored indexes that offer nearoptimal performance. However, an experimental study establishing which alternatives are actually better than brute force, and which perform best depending on the collection characteristics, has not been carried out. In this paper we address this shortcoming by exploring the relationship between the nature of the underlying collection and the performance of current methods. Via extensive experiments we show that established solutions are often beaten in practice by bruteforce alternatives. We also design new methods that offer superior time/space tradeoffs, particularly on repetitive collections. 1
Suffix Array of Alignment: A Practical Index for Similar Data
"... Abstract. The suffix tree of alignment is an index data structure for similar strings. Given an alignment of similar strings, it stores all suffixes of the alignment, called alignmentsuffixes. An alignmentsuffix represents one suffix of a string or suffixes of multiple strings starting at the sa ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Abstract. The suffix tree of alignment is an index data structure for similar strings. Given an alignment of similar strings, it stores all suffixes of the alignment, called alignmentsuffixes. An alignmentsuffix represents one suffix of a string or suffixes of multiple strings starting at the same position in the alignment. The suffix tree of alignment makes good use of similarity in strings theoretically. However, suffix trees are not widely used in biological applications because of their huge space requirements, and instead suffix arrays are used in practice. In this paper we propose a spaceeconomical version of the suffix tree of alignment, named the suffix array of alignment (SAA). Given an alignment of similar strings, the SAA for is a lexicographically sorted list of all the alignmentsuffixes of . The SAA supports pattern search as efficiently as the generalized suffix array. Our experiments show that our index uses only 14 % of the space used by the generalized suffix array to index 11 human genome sequences. The space efficiency of our index increases as the number of the genome sequences increases. We also present an efficient algorithm for constructing the SAA.
QGramProjector: QGram Projection for Indexing HighlySimilar Strings
"... Abstract. Qgram (or ngram, kmer) models are used in many research areas, e.g. in computational linguistics for statistical natural language processing, in computer science for approximate string searching, and in computational biology for sequence analysis and data compression. For a collection o ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. Qgram (or ngram, kmer) models are used in many research areas, e.g. in computational linguistics for statistical natural language processing, in computer science for approximate string searching, and in computational biology for sequence analysis and data compression. For a collection of N strings, one usually creates a separate positional qgram index structure for each string, or at least an index structure which needs roughly N times of storage compared to a single string index structure. For highlysimilar strings, redundancies can be identified, which do not need to be stored repeatedly; for instance two human genomes have more than 99 percent similarity. In this work, we propose QGramProjector, a new way of indexing many highlysimilar strings. In order to remove the redundancies caused by similarities, our proposal is to 1) create all qgrams for a fixed reference, 2) referentially compress all strings in the collection with respect to the reference, and then 3) project all qgrams from the reference to the compressed strings. Experiments show that a complete index can be relatively small compared to the collection of highlysimilar strings. For a collection of 1092 human genomes (raw data size is 3 TB), a 16gram index structure, which can be used for instance as a basis for multigenome read alignment, only needs 100.5 GB (compression ratio of 31:1). We think that our work is an important step towards analysis of large sets of highlysimilar genomes on commodity hardware.
AFaster Compressed Suffix Trees for Repetitive Collections
"... Recent compressed suffix trees targeted to highly repetitive sequence collections reach excellent compression performance, but operation times are very high. We design a new suffix tree representation for this scenario that still achieves very low space usage, only slightly larger than the best prev ..."
Abstract
 Add to MetaCart
Recent compressed suffix trees targeted to highly repetitive sequence collections reach excellent compression performance, but operation times are very high. We design a new suffix tree representation for this scenario that still achieves very low space usage, only slightly larger than the best previous one, but supports the operations orders of magnitude faster. Our suffix tree is still orders of magnitude slower than generalpurpose compressed suffix trees, but these use several times more space when the collection is repetitive. Our main novelty is a practical grammarcompressed tree representation with full navigation functionality, which is useful in all applications where large trees with repetitive topology must be represented.
Rank, select and access in grammarcompressed strings
, 2014
"... Given a string S of length N on a fixed alphabet of σ symbols, a grammar compressor produces a contextfree grammar G of size n that generates S and only S. In this paper we describe data structures to support the following operations on a grammarcompressed string: rankc(S, i) (return the number ..."
Abstract
 Add to MetaCart
(Show Context)
Given a string S of length N on a fixed alphabet of σ symbols, a grammar compressor produces a contextfree grammar G of size n that generates S and only S. In this paper we describe data structures to support the following operations on a grammarcompressed string: rankc(S, i) (return the number of occurrences of symbol c before position i in S); selectc(S, i) (return the position of the ith occurrence of c in S); and access(S, i, j) (return substring S[i, j]). For rank and select we describe data structures of size O(nσ logN) bits that support the two operations in O(logN) time. We propose another structure that uses O(nσ log(N/n)(logN)1+) bits and that supports the two queries in O(logN / log logN), where > 0 is an arbitrary constant. To our knowledge, we are the first to study the asymptotic complexity of rank and select in the grammarcompressed setting, and we provide a hardness result showing that significantly improving the bounds we achieve would imply a major breakthrough on a hard graphtheoretical problem. Our main result for access is a method that requires O(n logN) bits of space and O(logN + m / logσ N) time to extract m = j − i + 1 consecutive symbols from S. Alternatively, we can achieve O(logN / log logN+m / logσ N) query time using O(n log(N/n)(logN) 1+) bits of space. This matches a lower bound stated by Verbin and Yu for strings where N is polynomially related to n.