Results 1 
2 of
2
Practical Compressed String DictionariesI
"... The need to store and query a set of strings – a string dictionary – arises in many kinds of applications. While classically these string dictionaries have accounted for a small share of the total space budget (e.g., in Natural Language Processing or when indexing text collections), recent applicati ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
The need to store and query a set of strings – a string dictionary – arises in many kinds of applications. While classically these string dictionaries have accounted for a small share of the total space budget (e.g., in Natural Language Processing or when indexing text collections), recent applications in Web engines, Semantic Web (RDF) graphs, Bioinformatics, and many others, handle very large string dictionaries, whose size is a significant fraction of the whole data. In these cases, string dictionary management is a scalability issue by itself. This paper focuses on the problem of managing large static string dictionaries in compressed main memory space. We revisit classical solutions for string dictionaries like hashing, tries, and frontcoding, and improve them by using compression techniques. We also introduce some novel string dictionary representations built on top of recent advances in succinct data structures and fulltext indexes. All these structures are empirically compared on a heterogeneous testbed formed by realworld string dictionaries. We show that the compressed representations may use as little as 5 % of the original dictionary size, while supporting lookup operations within a few microseconds. These numbers outperform the stateoftheart space/time tradeoffs in many cases. Furthermore, we enhance some representations to provide prefix and substringbased searches, which also perform competitively. The results show that compressed string dictionaries are
Grammar Compression: Grammatical Inference by Compression and Its Application to Real Data
"... A grammatical inference algorithm tries to find as a small grammar as possible representing a potentially infinite sequence of strings. Here, let us consider a simple restriction: the input is a finite sequence or it might be a singleton set. Then the restricted problem is called the grammar compres ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
A grammatical inference algorithm tries to find as a small grammar as possible representing a potentially infinite sequence of strings. Here, let us consider a simple restriction: the input is a finite sequence or it might be a singleton set. Then the restricted problem is called the grammar compression to find the smallest CFG generating just the input. In the last decade many researchers have tackled this problem because of its scalable applications, e.g., expansion of data storage capacity, speedingup information retrieval, DNA sequencing, frequent pattern mining, and similarity search. We would review the history of grammar compression and its wide applications together with an important future work. The study of grammar compression has begun with the bad news: the smallest CFG problem is NPhard. Hence, the first question is: Can we get a nearoptimal solution in a polynomial time? (Is there a reasonable approximation algorithm?) And the next question is: Can we minimize the costs of time and space? (Does a linear time algorithm exist within an optimal working space?) The recent results produced by the research community answer affirmatively the questions. We introduce several important results and typical applications to a huge text collection. On the other hand, the shrinkage of the advantage of grammar compression is caused by the data explosion, since there is no working space for storing the whole data supplied from data stream. The last question is: How can we handle the stream data? For this question, we propose the framework of stream grammar compression for the next generation and its attractive application to fast data transmission. 1.