Results 1  10
of
207
Compressed fulltext indexes
 ACM COMPUTING SURVEYS
, 2007
"... Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text l ..."
Abstract

Cited by 263 (94 self)
 Add to MetaCart
(Show Context)
Fulltext indexes provide fast substring search over large text collections. A serious problem of these indexes has traditionally been their space consumption. A recent trend is to develop indexes that exploit the compressibility of the text, so that their size is a function of the compressed text length. This concept has evolved into selfindexes, which in addition contain enough information to reproduce any text portion, so they replace the text. The exciting possibility of an index that takes space close to that of the compressed text, replaces it, and in addition provides fast search over it, has triggered a wealth of activity and produced surprising results in a very short time, and radically changed the status of this area in less than five years. The most successful indexes nowadays are able to obtain almost optimal space and search time simultaneously. In this paper we present the main concepts underlying selfindexes. We explain the relationship between text entropy and regularities that show up in index structures and permit compressing them. Then we cover the most relevant selfindexes up to date, focusing on the essential aspects on how they exploit the text compressibility and how they solve efficiently various search problems. We aim at giving the theoretical background to understand and follow the developments in this area.
Spamming botnets: signatures and characteristics
 In SIGCOMM
, 2008
"... In this paper, we focus on characterizing spamming botnets by leveraging both spam payload and spam server traffic properties. Towards this goal, we developed a spam signature generation framework called AutoRE to detect botnetbased spam emails and botnet membership. AutoRE does not require precla ..."
Abstract

Cited by 120 (14 self)
 Add to MetaCart
(Show Context)
In this paper, we focus on characterizing spamming botnets by leveraging both spam payload and spam server traffic properties. Towards this goal, we developed a spam signature generation framework called AutoRE to detect botnetbased spam emails and botnet membership. AutoRE does not require preclassified training data or white lists. Moreover, it outputs high quality regular expression signatures that can detect botnet spam with a low false positive rate. Using a threemonth sample of emails from Hotmail, AutoRE successfully identified 7,721 botnetbased spam campaigns together with 340,050 unique botnet host IP addresses. Our indepth analysis of the identified botnets revealed several interesting findings regarding the degree of email obfuscation, properties of botnet IP addresses, sending patterns, and their correlation with network scanning traffic. We believe these observations are useful information in the design of botnet detection schemes.
Hamsa: Fast signature generation for zeroday polymorphicworms with provable attack resilience.
 In S&P,
, 2006
"... Abstract ..."
(Show Context)
Transcriptome sequencing to detect gene fusions in cancer
 NATURE
, 2009
"... Recurrent gene fusions, typically associated with hematological malignancies and rare bone and soft tissue tumors1, have been recently described in common solid tumors2–9. Here we employ an integrative analysis of highthroughput long and short read transcriptome sequencing of cancer cells to discov ..."
Abstract

Cited by 90 (4 self)
 Add to MetaCart
Recurrent gene fusions, typically associated with hematological malignancies and rare bone and soft tissue tumors1, have been recently described in common solid tumors2–9. Here we employ an integrative analysis of highthroughput long and short read transcriptome sequencing of cancer cells to discover novel gene fusions. As a proof of concept we successfully utilized integrative transcriptome sequencing to “rediscover” the BCRABL1 10 gene fusion in a chronic myelogenous leukemia cell line and the TMPRSS2ERG 2,3 gene fusion in a prostate cancer cell line and tissues. Additionally, we nominated, and experimentally validated, novel gene fusions resulting in chimeric transcripts in cancer cell lines and tumors. Taken together, this study establishes a robust pipeline for the discovery of novel gene chimeras using high throughput sequencing, opening up an important class of cancerrelated mutations for comprehensive characterization.
CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats
 Nucleic Acids Res
, 2007
"... short palindromic repeats ..."
(Show Context)
Hierarchical phrasebased translation with suffix arrays
 In Proc. of EMNLPCoNLL
, 2007
"... A major engineering challenge in statistical machine translation systems is the efficient representation of extremely large translation rulesets. In phrasebased models, this problem can be addressed by storing the training data in memory and using a suffix array as an efficient index to quickly lo ..."
Abstract

Cited by 56 (5 self)
 Add to MetaCart
A major engineering challenge in statistical machine translation systems is the efficient representation of extremely large translation rulesets. In phrasebased models, this problem can be addressed by storing the training data in memory and using a suffix array as an efficient index to quickly lookup and extract rules on the fly. Hierarchical phrasebased translation introduces the added wrinkle of source phrases with gaps. Lookup algorithms used for contiguous phrases no longer apply and the best approximate pattern matching algorithms are much too slow, taking several minutes per sentence. We describe new lookup algorithms for hierarchical phrasebased translation that reduce the empirical computation time by nearly two orders of magnitude, making onthefly lookup feasible for source phrases with gaps. 1
A new succinct representation of RMQinformation and improvements in the enhanced suffix array
 PROC. ESCAPE. LNCS
, 2007
"... The RangeMinimumQueryProblem is to preprocess an array of length n in O(n) time such that all subsequent queries asking for the position of a minimal element between two specified indices can be obtained in constant time. This problem was first solved by Berkman and Vishkin [1], and Sadakane [2] ..."
Abstract

Cited by 52 (15 self)
 Add to MetaCart
(Show Context)
The RangeMinimumQueryProblem is to preprocess an array of length n in O(n) time such that all subsequent queries asking for the position of a minimal element between two specified indices can be obtained in constant time. This problem was first solved by Berkman and Vishkin [1], and Sadakane [2] gave the first succinct data structure that uses 4n+o(n) bits of additional space. In practice, this method has several drawbacks: it needs O(nlog n) bits of intermediate space when constructing the data structure, and it builds on previous results on succinct data structures. We overcome these problems by giving the first algorithm that never uses more than 2n + o(n) bits, and does not rely on rank and selectqueries or other succinct data structures. We stress the importance of this result by simplifying and reducing the space consumption of the Enhanced Suffix Array [3], while retaining its capability of simulating topdowntraversals of the suffix tree, used, e.g., to locate all occ positions of a pattern p in a text in optimal O(p  + occ) time (assuming constant alphabet size). We further prove a lower bound of 2n − o(n) bits, which makes our algorithm asymptotically optimal.
SpaceEfficient Preprocessing Schemes for Range Minimum Queries on Static Arrays
, 2009
"... Given a static array of n totally ordered object, the range minimum query problem is to build an additional data structure that allows to answer subsequent online queries of the form “what is the position of a minimum element in the subarray ranging from i to j? ” efficiently. We focus on two sett ..."
Abstract

Cited by 47 (3 self)
 Add to MetaCart
(Show Context)
Given a static array of n totally ordered object, the range minimum query problem is to build an additional data structure that allows to answer subsequent online queries of the form “what is the position of a minimum element in the subarray ranging from i to j? ” efficiently. We focus on two settings, where (1) the input array is available at query time, and (2) the input array is only available at construction time. In setting (1), we show new data structures (a) of n c(n) (2 + o(1)) bits and query time O(c(n)), or (b) with O(nHk) + o(n) bits and O(1) query size time, where Hk denotes the empirical entropy of k’th order of the input array. In setting (2), we give a data structure of optimal size 2n + o(n) bits and query time O(1). All data structures can be constructed in linear time and almost inplace.
Optimal Succinctness for Range Minimum Queries
"... Abstract. For an array A of n objects from a totally ordered universe, a range minimum query rmq A(i, j) asks for the position of the minimum element in the subarray A[i, j]. We focus on the setting where the array A is static and known in advance, and can hence be preprocessed into a scheme in ord ..."
Abstract

Cited by 35 (4 self)
 Add to MetaCart
(Show Context)
Abstract. For an array A of n objects from a totally ordered universe, a range minimum query rmq A(i, j) asks for the position of the minimum element in the subarray A[i, j]. We focus on the setting where the array A is static and known in advance, and can hence be preprocessed into a scheme in order to answer future queries faster. We make the further assumption that the input array A cannot be used at query time. Under this assumption, a natural lower bound of 2n − Θ(log n) bits for RMQschemes exists. We give the first truly succinct preprocessing scheme for O(1)RMQs. Its final space consumption is 2n + o(n) bits, thus being asymptotically optimal. We also give a simple lineartime construction algorithm for this scheme that needs only n + o(n) bits of space in addition to the 2n + o(n) bits needed for the final data structure, thereby lowering the peak space consumption of previous schemes from O(n log n) to O(n) bits. We also improve on LCAcomputation in BPS and DFUDSencoded trees. 1
Theoretical and practical improvements on the RMQproblem, with applications to LCA and LCE
 PROC. CPM. VOLUME 4009 OF LNCS
, 2006
"... The RangeMinimumQueryProblem is to preprocess an array such that the position of the minimum element between two specified indices can be obtained efficiently. We present a direct algorithm for the general RMQproblem with linear preprocessing time and constant query time, without making use of ..."
Abstract

Cited by 34 (9 self)
 Add to MetaCart
(Show Context)
The RangeMinimumQueryProblem is to preprocess an array such that the position of the minimum element between two specified indices can be obtained efficiently. We present a direct algorithm for the general RMQproblem with linear preprocessing time and constant query time, without making use of any dynamic data structure. It consumes less than half of the space that is needed by the method by Berkman and Vishkin. We use our new algorithm for RMQ to improve on LCAcomputation for binary trees, and further give a constanttime LCEalgorithm solely based on arrays. Both LCA and LCE have important applications, e.g., in computational biology. Experimental studies show that our new method is almost twice as fast in practice as previous approaches, and asymptotically slower variants of the constanttime algorithms perform even better for today’s common problem sizes.