Results 1 -
7 of
7
Seed optimization is no easier than optimal Golomb ruler design
- in Proceedings of the 6th Asia Pacific Bioinformatics Conference (APBC
, 2008
"... Spaced seed is a filter method invented to efficiently identify the regions of interest in similarity searches. It is now well known that certain spaced seeds hit (detect) a randomly sampled similarity region with higher probabilities than the others. Assume each position of the similarity region is ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Spaced seed is a filter method invented to efficiently identify the regions of interest in similarity searches. It is now well known that certain spaced seeds hit (detect) a randomly sampled similarity region with higher probabilities than the others. Assume each position of the similarity region is identity with probability p independently. The seed optimization problem seeks for the optimal seed achieving the highest hit probability with given length and weight. Despite that the problem was previously shown not to be NP-hard, in practice it seems difficult to solve. The only algorithm known to compute the optimal seed is still exhaustive search in exponential time. In this article we put some insight into the hardness of the seed design problem by demonstrating the relation between the seed optimization problem and the optimal Golomb ruler design problem, which is a well known difficult problem in combinatorial design.
Amino Acid Classification and Hash Seeds for Homology Search
- BICOB
, 2009
"... Spaced seeds have been extensively studied in the homology search field. A spaced seed can be regarded as a very special type of hash function on k-mers, where two k-mers have the same hash value if and only if they are identical at the w (w
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Spaced seeds have been extensively studied in the homology search field. A spaced seed can be regarded as a very special type of hash function on k-mers, where two k-mers have the same hash value if and only if they are identical at the w (w <k) positions designated by the seed. Spaced seeds substantially increased the homology search sensitivity. It is then a natural question to ask whether there is a better hash function (called hash seed) that provides better sensitivity than the spaced seed. We study this question in the paper. We propose a strategy to classify amino acids, which leads to a better hash seed. Our results raise a new question about how to design the best hash seed.
Masking Patterns in Sequences: A New Class of Motif Discovery with Don’t Cares
, 2009
"... In this paper, we introduce a new notion of motifs, called masks, that succinctly represent the repeated patterns for an input sequence T of n symbols drawn from an alphabet Σ. We show how to build the set of all maximal masks of length L and quorum q, in O(2 L n) time and space in the worst case. W ..."
Abstract
- Add to MetaCart
In this paper, we introduce a new notion of motifs, called masks, that succinctly represent the repeated patterns for an input sequence T of n symbols drawn from an alphabet Σ. We show how to build the set of all maximal masks of length L and quorum q, in O(2 L n) time and space in the worst case. We analytically show that our algorithms perform better than constant-time enumerating and checking all the potential (|Σ | + 1) L candidate patterns in T after a polynomial-time preprocessing of T. Our algorithms are also cache-friendly, attaining O(2 L sort(n)) block transfers, where sort(n) is the cache oblivious complexity of sorting n items. Key words: Motif inference, motifs with don’t care, motif partial order, motifs with masks. 1.
Seed optimization for i.i.d. similarities is no easier than optimal Golomb ruler design
- INFORMATION PROCESSING LETTERS
, 2009
"... The spaced seed is a filtration method to efficiently identify the regions of interest in string similarity searches. It is important to find the optimal spaced seed that achieves the highest search sensitivity. For some simple distributions of the similarities, the seed optimization problem was pro ..."
Abstract
- Add to MetaCart
The spaced seed is a filtration method to efficiently identify the regions of interest in string similarity searches. It is important to find the optimal spaced seed that achieves the highest search sensitivity. For some simple distributions of the similarities, the seed optimization problem was proved to be not NP-hard. On the other hand, no polynomial time algorithm has been found despite the extensive researches in the literature. In this article we examine the hardness of the seed optimization problem by a polynomial time reduction from the optimal Golomb ruler design problem, which is a well-known difficult (but not NP-hard) problem in combinatorial design.
Project-Team sequoia - Algorithms for large-scale sequence analysis for molecular biology - INRIA Activity Report
, 2008
"... The main goal of SEQUOIA project-team is to define appropriate combinatorial models and efficient algorithms for large-scale sequence analysis in molecular biology. An emphasis is made on the annotation of non-coding regions in genomes – RNA genes and regulatory sequences – via comparative genomics ..."
Abstract
- Add to MetaCart
The main goal of SEQUOIA project-team is to define appropriate combinatorial models and efficient algorithms for large-scale sequence analysis in molecular biology. An emphasis is made on the annotation of non-coding regions in genomes – RNA genes and regulatory sequences – via comparative genomics methods. This task involves several complementary issues such as sequence comparison, prediction, analysis and manipulation of RNA secondary structures, identification and processing of regulatory sequences. Efficient algorithms and parallelism on high-performance computing architectures allow large-scale instances of such issues. Our aim is to tackle all those issues in an integrated fashion and to put together the developed software tools into a common platform for annotation of non-coding regions. We also explore complementary problems of protein sequence analysis. Those include new approaches to protein sequence comparison on the one hand, and a system for storing and manipulating nonribosomal peptides on the other hand. A special attention is given to the development of robust software, its validation on biological data and to its availability from the software platform of the team and by other means. Most of research projects are carried out in collaboration with biologists.
Project-Team sequoia - Algorithms for large-scale sequence analysis Algorithms for large-scale sequence analysis for molecular biology - INRIA Activity Report
, 2007
"... The main goal of SEQUOIA project-team is to define appropriate combinatorial models and efficient algorithms for large-scale sequence analysis in molecular biology. An emphasis is made on the annotation of non-coding regions in genomes – RNA genes and regulatory sequences – via comparative genomics ..."
Abstract
- Add to MetaCart
The main goal of SEQUOIA project-team is to define appropriate combinatorial models and efficient algorithms for large-scale sequence analysis in molecular biology. An emphasis is made on the annotation of non-coding regions in genomes – RNA genes and regulatory sequences – via comparative genomics methods. This task involves several complementary issues such as large-scale sequence comparison, prediction, analysis and manipulation of RNA secondary structures, identification and processing of regulatory sequences. Our aim is to tackle all those issues in an integrated fashion and to put together the developed software tools into a common platform for annotation of non-coding regions. We also explore complementary problems of protein sequence analysis. Those include new approaches to protein sequence comparison on the one hand, and a system for storing and manipulating nonribosomal peptides on the other hand. A special attention is given to the development of robust software, its validation on biological data and to its availability from the software platform of the team and by other means. Most of research projects are carried out in collaboration with biologists.
Examining Board:
"... The primary goal of bioinformatics is to increase an understanding in the biology of organisms. Computational, statistical, and mathematical theories and techniques have been developed on formal and practical problems that assist to achieve this primary goal. For the past three decades, the primary ..."
Abstract
- Add to MetaCart
The primary goal of bioinformatics is to increase an understanding in the biology of organisms. Computational, statistical, and mathematical theories and techniques have been developed on formal and practical problems that assist to achieve this primary goal. For the past three decades, the primary application of bioinformatics has been biological data analysis. The DNA or protein sequence similarity search is perhaps the most common, yet vitally important task for analyzing biological data. The sequence similarity search is a process of finding optimal sequence alignments. On the theoretical level, the problem of sequence similarity search is complex. On the applicational level, the sequences similarity search onto a biological database has been one of the most basic tasks today. Using traditional quadratic time complexity solutions becomes a challenge due to the size of the database. Seeding (or filtration) based approaches, which trade sensitivity for speed, are a popular choice among those available. Two main phases usually exist in a seeding based approach. The first phase is referred to as the hit generation, and the second phase is referred

