Results 1  10
of
22
Textual data compression in computational biology: a synopsis
 Bioinformatics
, 2009
"... Motivation: Textual data compression, and the associated techniques coming from information theory, are often perceived as being of interest for data communication and storage. However, they are also deeply related to classification and data mining and analysis. In recent years, a substantial effort ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
(Show Context)
Motivation: Textual data compression, and the associated techniques coming from information theory, are often perceived as being of interest for data communication and storage. However, they are also deeply related to classification and data mining and analysis. In recent years, a substantial effort has been made for the application of textual data compression techniques to various computational biology tasks, ranging from storage and indexing of large datasets to comparison and reverse engineering of biological networks. Results: The main focus of this review is on a systematic presentation of the key areas of bioinformatics and computational biology where compression has been used. When possible, a unifying organization of the main ideas and techniques is also provided. Availability: It goes without saying that most of the research results reviewed here offer software prototypes to the bioinformatics community. The supplementary material (see next) provides pointers to software and benchmark datasets for a range of applications of broad interest. Contact:
A UNIFIED ALGORITHM FOR ACCELERATING EDITDISTANCE COMPUTATION via . . .
, 2009
"... The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamicprogramming solution for this problem computes the editdistance between a pair of strings of total length O(N) in O(N²) time. To th ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamicprogramming solution for this problem computes the editdistance between a pair of strings of total length O(N) in O(N²) time. To this date, this quadratic upperbound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N 2) editdistance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single editdistance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straightline programs. These provide a generic platform
Algorithmics on SLPcompressed strings: a survey,
 Groups Complex. Cryptol.
, 2012
"... Abstract Results on algorithmic problems on strings that are given in a compressed form via straightline programs are surveyed. A straightline program is a contextfree grammar that generates exactly one string. In this way, exponential compression rates can be achieved. Among others, we study pat ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
(Show Context)
Abstract Results on algorithmic problems on strings that are given in a compressed form via straightline programs are surveyed. A straightline program is a contextfree grammar that generates exactly one string. In this way, exponential compression rates can be achieved. Among others, we study pattern matching for compressed strings, membership problems for compressed strings in various kinds of formal languages, and the problem of querying compressed strings. Applications in combinatorial group theory and computational topology and to the solution of word equations are discussed as well. Finally, extensions to compressed trees and pictures are considered.
Efficient Staggered Decoding for Sequence Labeling
 Proceedings of ACL
, 2010
"... The Viterbi algorithm is the conventional decoding algorithm most widely adopted for sequence labeling. Viterbi decoding is, however, prohibitively slow when the label set is large, because its time complexity is quadratic in the number of labels. This paper proposes an exact decoding algorithm that ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
The Viterbi algorithm is the conventional decoding algorithm most widely adopted for sequence labeling. Viterbi decoding is, however, prohibitively slow when the label set is large, because its time complexity is quadratic in the number of labels. This paper proposes an exact decoding algorithm that overcomes this problem. A novel property of our algorithm is that it efficiently reduces the labels to be decoded, while still allowing us to check the optimality of the solution. Experiments on three tasks (POS tagging, joint POS tagging and chunking, and supertagging) show that the new algorithm is several orders of magnitude faster than the basic Viterbi and a stateoftheart algorithm,
GrammarBased Compression in a Streaming Model
 LATA 2010. LNCS
, 2010
"... We show that, given a string s of length n, with constant memory and logarithmic passes over a constant number of streams we can build a contextfree grammar that generates s and only s and whose size is within an O min g log g, n / log nfactor of the minimum g. This stands in contrast to our pre ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
We show that, given a string s of length n, with constant memory and logarithmic passes over a constant number of streams we can build a contextfree grammar that generates s and only s and whose size is within an O min g log g, n / log nfactor of the minimum g. This stands in contrast to our previous result that, with polylogarithmic memory and polylogarithmic passes over a single stream, we cannot build such a grammar whose size is within any polynomial of g.
CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning
"... The growth of information available to learning systems and the increasing complexity of learning tasks determine the need for devising algorithms that scale well with respect to all learning parameters. In the context of supervised sequential learning, the Viterbi algorithm plays a fundamental role ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
The growth of information available to learning systems and the increasing complexity of learning tasks determine the need for devising algorithms that scale well with respect to all learning parameters. In the context of supervised sequential learning, the Viterbi algorithm plays a fundamental role, by allowing the evaluation of the best (most probable) sequence of labels with a time complexity linear in the number of time events, and quadratic in the number of labels. In this paper we propose CarpeDiem, a novel algorithm allowing the evaluation of the best possible sequence of labels with a subquadratic time complexity. 1 We provide theoretical grounding together with solid empirical results supporting two chief facts. CarpeDiem always finds the optimal solution requiring, in most cases, only a small fraction of the time taken by the Viterbi algorithm; meantime, CarpeDiem is never asymptotically worse than the Viterbi algorithm, thus confirming it as a sound replacement.
Unified CompressionBased Acceleration of EditDistance Computation
"... The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the editdistance between a pair of strings of total length O(N) in O(N²) time. To thi ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the editdistance between a pair of strings of total length O(N) in O(N²) time. To this date, this quadratic upperbound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N 2) editdistance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single editdistance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight line programs. These provide a generic platform for representing many popular compression schemes including the LZfamily, RunLength Encoding, BytePair Encoding, and dictionary methods. For two strings of total length N having straightline program representations of total size n, we present an
Speeding up Bayesian HMM by the four Russians method
 IN: PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON ALGORITHMS IN BIOINFORMATICS
, 2011
"... Bayesian computations with Hidden Markov Models (HMMs) are often avoided in practice. Instead, due to reduced running time, point estimates – maximum likelihood (ML) or maximum a posterior (MAP) – are obtained and observation sequences are segmented based on the Viterbi path, even though the lack ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Bayesian computations with Hidden Markov Models (HMMs) are often avoided in practice. Instead, due to reduced running time, point estimates – maximum likelihood (ML) or maximum a posterior (MAP) – are obtained and observation sequences are segmented based on the Viterbi path, even though the lack of accuracy and dependency on starting points of the local optimization are well known. We propose a method to speedup Bayesian computations which addresses this problem for regular and timedependent HMMs with discrete observations. In particular, we show that by exploiting sequence repetitions, using the four Russians method, and the conditional dependency structure, it is possible to achieve a Θ(log T) speedup, where T is the length of the observation sequence. Our experimental results on identification of segments of homogeneous nucleic acid composition, known as the DNA segmentation problem, show that the speedup is also observed in practice.
Accelerating Dynamic Programming
, 2009
"... Dynamic Programming (DP) is a fundamental problemsolving technique that has been widely used for solving a broad range of search and optimization problems. While DP can be invoked when more specialized methods fail, this generality often incurs a cost in efficiency. We explore a unifying toolkit fo ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Dynamic Programming (DP) is a fundamental problemsolving technique that has been widely used for solving a broad range of search and optimization problems. While DP can be invoked when more specialized methods fail, this generality often incurs a cost in efficiency. We explore a unifying toolkit for speeding up DP, and algorithms that use DP as subroutines. Our methods and results can be summarized as follows. – Acceleration via Compression. Compression is traditionally used to efficiently store data. We use compression in order to identify repeats in the table that imply a redundant computation. Utilizing these repeats requires a new DP, and often different DPs for different compression schemes. We present the first provable speedup of the celebrated Viterbi algorithm (1967) that is used for the decoding and training of Hidden Markov Models (HMMs). Our speedup relies on the compression of the HMM’s observable sequence. – Totally Monotone Matrices. It is well known that a wide variety of DPs can be reduced to the problem of finding row minima in totally monotone matrices. We introduce this scheme in the context of planar graph problems. In particular, we show that planar graph problems
Data and text mining Textual data compression in computational biology: a synopsis
"... Motivation: Textual data compression, and the associated techniques coming from information theory, are often perceived as being of interest for data communication and storage. However, they are also deeply related to classification and data mining and analysis. In recent years, a substantial effort ..."
Abstract
 Add to MetaCart
Motivation: Textual data compression, and the associated techniques coming from information theory, are often perceived as being of interest for data communication and storage. However, they are also deeply related to classification and data mining and analysis. In recent years, a substantial effort has been made for the application of textual data compression techniques to various computational biology tasks, ranging from storage and indexing of large datasets to comparison and reverse engineering of biological networks. Results: The main focus of this review is on a systematic presentation of the key areas of bioinformatics and computational biology where compression has been used. When possible, a unifying organization of the main ideas and techniques is also provided. Availability: It goes without saying that most of the research results reviewed here offer software prototypes to the bioinformatics community. The Supplementary Material provides pointers to software and benchmark datasets for a range of applications of broad interest. In addition to provide reference to software, the Supplementary Material also gives a brief presentation of some fundamental results and techniques related to this paper. It is at: