Grammar Compressed Sequences with Rank/Select Support?
Abstract. Sequence representations supporting not only direct access to their symbols, but also rank/select operations, are a fundamental building block in many compressed data structures. In several recent applications, the need to represent highly repetitive sequences arises, where statistical compression is ineffective. We introduce grammarbased representations for repetitive sequences, which use up to 10 % of the space needed by representations based on statistical compression, and support direct access and rank/select operations within tens of microseconds. 1
Faster compressed suffix trees for repetitive text collections
, 2014
Abstract. Recent compressed suffix trees targeted to highly repetitive text collections reach excellent compression performance, but operation times in the order of milliseconds. We design a new suffix tree representation for this scenario that still achieves very low space usage, only slightly larger than the best previous one, but supports the operations within microseconds. This puts the data structure in the same performance level of compressed suffix trees designed for standard text collections, which on repetitive collections use many times more space than our new structure. 1
Grammar Compression: Grammatical Inference by Compression and Its Application to Real Data
A grammatical inference algorithm tries to find as a small grammar as possible representing a potentially infinite sequence of strings. Here, let us consider a simple restriction: the input is a finite sequence or it might be a singleton set. Then the restricted problem is called the grammar compression to find the smallest CFG generating just the input. In the last decade many researchers have tackled this problem because of its scalable applications, e.g., expansion of data storage capacity, speedingup information retrieval, DNA sequencing, frequent pattern mining, and similarity search. We would review the history of grammar compression and its wide applications together with an important future work. The study of grammar compression has begun with the bad news: the smallest CFG problem is NPhard. Hence, the first question is: Can we get a nearoptimal solution in a polynomial time? (Is there a reasonable approximation algorithm?) And the next question is: Can we minimize the costs of time and space? (Does a linear time algorithm exist within an optimal working space?) The recent results produced by the research community answer affirmatively the questions. We introduce several important results and typical applications to a huge text collection. On the other hand, the shrinkage of the advantage of grammar compression is caused by the data explosion, since there is no working space for storing the whole data supplied from data stream. The last question is: How can we handle the stream data? For this question, we propose the framework of stream grammar compression for the next generation and its attractive application to fast data transmission. 1.
Online pattern matching for string edit distance with moves
AFaster Compressed Suffix Trees for Repetitive Collections
Recent compressed suffix trees targeted to highly repetitive sequence collections reach excellent compression performance, but operation times are very high. We design a new suffix tree representation for this scenario that still achieves very low space usage, only slightly larger than the best previous one, but supports the operations orders of magnitude faster. Our suffix tree is still orders of magnitude slower than generalpurpose compressed suffix trees, but these use several times more space when the collection is repetitive. Our main novelty is a practical grammarcompressed tree representation with full navigation functionality, which is useful in all applications where large trees with repetitive topology must be represented.