Results 1  10
of
22
A novel method for multiple alignment of sequences with repeated and shuffled elements
, 2004
"... ..."
Fragment assembly with short reads
 Bioinformatics
, 2004
"... Motivation: Current DNA sequencing technology produces reads of about 500750 base pairs (bp) with typical coverage under 10X. New sequencing technologies are emerging that produce shorter reads (length 80200 bp) but allow one to generate significantly higher coverage (30X and higher) at low cost. ..."
Abstract

Cited by 43 (4 self)
 Add to MetaCart
(Show Context)
Motivation: Current DNA sequencing technology produces reads of about 500750 base pairs (bp) with typical coverage under 10X. New sequencing technologies are emerging that produce shorter reads (length 80200 bp) but allow one to generate significantly higher coverage (30X and higher) at low cost. Modern assembly programs and error correction routines have been tuned to work well with current read technology, but were not designed for assembly of short reads. Results: We analyze the limitations of assembling reads generated by these new technologies, and present a routine for basecalling in reads prior to their assembly. We demonstrate that while it is feasible to assemble such short reads, the resulting contigs will require significant (if not prohibitive) finishing efforts. Contact:
Searching for Jumbled Patterns in Strings
, 2009
"... A The Parikh vector of a string s over a finite ordered alphabet Σ = {a1,..., aσ} is defined as the vector of multiplicities of the characters, i.e. p(s) = (p1,..., pσ), where pi = {j  sj = ai}. Parikh vector q occurs in s if s has a substring t with p(t) = q. The problem of searching for a que ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
A The Parikh vector of a string s over a finite ordered alphabet Σ = {a1,..., aσ} is defined as the vector of multiplicities of the characters, i.e. p(s) = (p1,..., pσ), where pi = {j  sj = ai}. Parikh vector q occurs in s if s has a substring t with p(t) = q. The problem of searching for a query q in a text s of length n can be solved simply and optimally with a sliding window approach in O(n) time. We present two new algorithms for the case where the text is fixed and many queries arrive over time. The first algorithm finds all occurrences of a given Parikh vector in a text (over a fixed alphabet of size σ ≥ 2) and appears to have a sublinear expected time complexity. The second algorithm only decides whether a given Parikh vector appears in a binary text; it iteratively constructs a linear size data structure which then allows answering queries in constant time, for many queries even during the construction phase.
Algorithms for jumbled pattern matching in strings
 IJFCS
, 2011
"... The Parikh vector p(s) of a string s over a finite ordered alphabet Σ = {a1,..., aσ} is defined as the vector of multiplicities of the characters, p(s) = (p1,..., pσ), where pi = {j  sj = ai}. Parikh vector q occurs in s if s has a substring t with p(t) = q. The problem of searching for a quer ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
The Parikh vector p(s) of a string s over a finite ordered alphabet Σ = {a1,..., aσ} is defined as the vector of multiplicities of the characters, p(s) = (p1,..., pσ), where pi = {j  sj = ai}. Parikh vector q occurs in s if s has a substring t with p(t) = q. The problem of searching for a query q in a text s of length n can be solved simply and worstcase optimally with a sliding window approach in O(n) time. We present two novel algorithms for the case where the text is fixed and many queries arrive over time. The first algorithm only decides whether a given Parikh vector appears in a binary text. It uses a linear size data structure and decides each query in O(1) time. The preprocessing can be done trivially in Θ(n2) time. The second algorithm finds all occurrences of a given Parikh vector in a text over an arbitrary alphabet of size σ ≥ 2 and has sublinear expected time complexity. More precisely, we present two variants of the algorithm, both using an O(n) size data structure, each of which can be constructed in O(n) time. The first solution is very simple and easy to implement and leads to an expected query time of O(n ( σ log σ)1/2 logm√ m), where m = i qi is the length of a string with Parikh vector q. The second uses wavelet trees and improves the expected runtime to O(n ( σ log σ
Efficient Mass Decomposition
, 2005
"... We study the problem of decomposing a positive integer M over a (fixed and finite) weighted alphabet Σ: We want to find nonnegative integers ci such that M = c1a1+...+ckak, where the ai are the positive integer weights of the individual characters and Σ  = k. We refer to the vector (c1,..., ck) ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
We study the problem of decomposing a positive integer M over a (fixed and finite) weighted alphabet Σ: We want to find nonnegative integers ci such that M = c1a1+...+ckak, where the ai are the positive integer weights of the individual characters and Σ  = k. We refer to the vector (c1,..., ck) as a witness (of M over Σ), and denote by γ(M) the number of distinct witnesses of M. We present a data structure of size O(ka1) that allows finding all witnesses of any query M in time O(ka1·γ(M)). To the best of our knowledge, this is the first algorithm for the problem with runtime independent of the size of the query M. Construction of the data structure requires O(ka1) time and constant additional space, and is very easy to implement. The problem is motivated by mass spectrometry experiments, where peaks need to be mapped to sample molecules whose mass they could represent. Our simulations show that the algorithm presented performs well on relevant applications.
HM: Mass Spectra Alignments and Their Significance
 In Combinatorial Pattern Matching Volume 3537. Edited by: Apostolico A, Crochemore M, Park K
"... Abstract. Mass Spectrometry has become one of the most popular analysis techniques in Genomics and Systems Biology. We investigate a general framework that allows the alignment (or matching) of any two mass spectra. In particular, we examine the alignment of a reference mass spectrum generated in si ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Mass Spectrometry has become one of the most popular analysis techniques in Genomics and Systems Biology. We investigate a general framework that allows the alignment (or matching) of any two mass spectra. In particular, we examine the alignment of a reference mass spectrum generated in silico from a database, with a measured sample mass spectrum. In this context, we assess the significance of alignment scores for characterspecific cleavage experiments, such as tryptic digestion of amino acids. We present an efficient approach to estimate this significance, with runtime linear in the number of detected peaks. In this context, we investigate the probability that a random string over a weighted alphabet contains a substring of some given weight. 1
SEQUENCING FROM COMPOMERS: THE PUZZLE
"... The board game Fragmind TM poses the following question: The player has to reconstruct an (unknown) string s over the alphabet Σ. To this end, the game reports the following information to the player, for every character x∈Σ: First, the string s is cleaved wherever the character x is found in s. Sec ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
The board game Fragmind TM poses the following question: The player has to reconstruct an (unknown) string s over the alphabet Σ. To this end, the game reports the following information to the player, for every character x∈Σ: First, the string s is cleaved wherever the character x is found in s. Second, every resulting fragment y is scrambled by a random permutation so that the only information left is how many times y contains each character σ ∈ Σ. These scrambled fragments (or compomers) are then reported to the player. Clearly, distinct strings can show identical cleavage patterns for all cleavage characters. In fact, even short strings of length 30+ usually have nonunique cleavage patterns. To this end, we introduce a generalization of the game setup called Sequencing From Compomers: We also generate those fragments of s that contain up to k uncleaved characters x, for some small and fixed threshold k. Surprisingly, this modification dramatically increases the length of strings that can be uniquely reconstructed. We show that it is NPhard to decide whether there exists some string compatible with the input data, but we also present a branchandbound runtime heuristic to find all such strings. Here, the input data is transformed into subgraphs of the de Bruijn graph, and we search for walks in these subgraphs simultaneously. The above problem directly stems from the analysis of Mass Spectrometry data from basespecific cleavage of DNA sequences, and gives rise to a completely new approach to DNA denovo sequencing. 1.
Melodic String Matching via Interval Consolidation and Fragmentation
"... Abstract. In this paper, we address the problem of melodic string matching that enables identification of varied (ornamented) instances of a given melodic pattern. To this aim, a new set of edit distance operations adequate for pitch interval strings is introduced. Insertion, deletion and replacemen ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper, we address the problem of melodic string matching that enables identification of varied (ornamented) instances of a given melodic pattern. To this aim, a new set of edit distance operations adequate for pitch interval strings is introduced. Insertion, deletion and replacement operations are abolished as irrelevant. Consolidation and fragmentation are retained, but adapted to the pitch interval domain, i.e., two or more intervals of one string may be matched to an interval from a second string through consolidation or fragmentation. The melodic interval string matching problem consists of finding all occurrences of a given pattern in a melodic sequence that takes into account exact matches, consolidations and fragmentations of intervals in both the sequence and the pattern. We show some properties of the problem and an algorithm that solves this problem is proposed.
unknown title
"... Vol. 23 ECCB 2006, pages e5–e11 doi:10.1093/bioinformatics/btl291 Simulating multiplexed SNP discovery rates using basespecific cleavage and mass spectrometry ..."
Abstract
 Add to MetaCart
Vol. 23 ECCB 2006, pages e5–e11 doi:10.1093/bioinformatics/btl291 Simulating multiplexed SNP discovery rates using basespecific cleavage and mass spectrometry