Results 1  10
of
15
Gene Structure Prediction by Linguistic Methods
 Genomics
, 1994
"... The higherorder structure of genes and other features of biological sequences can be described by means of formal grammars. These grammars can then be used by generalpurpose parsers to detect and assemble such structures by means of syntactic pattern recognition. We describe a grammar and parser f ..."
Abstract

Cited by 86 (2 self)
 Add to MetaCart
(Show Context)
The higherorder structure of genes and other features of biological sequences can be described by means of formal grammars. These grammars can then be used by generalpurpose parsers to detect and assemble such structures by means of syntactic pattern recognition. We describe a grammar and parser for eukaryotic proteinencoding genes, which by some measures is as effective as current connectionist and combinatorial algorithms in predicting gene structures for sequence database entries. Parameters on the grammar rules are optimized for several different species, and mixing experiments performed to determine the degree of species specificity and the relative importance of compositional, signalbased, and syntactic components in gene prediction. Introduction Formal language theory views languages as sets of strings over some alphabet, and specifies potentially infinite languages with concise sets of rules called grammars [10]. Grammars are an exceptionally wellstudied methodology, fami...
String Variable Grammar: A Logic Grammar Formalism For The Biological Language Of DNA
, 1993
"... this paper, we present a generalized form of SVG, which supports additional biologicallyrelevant operations by going beyond homomorphisms, instead uniformly applying substitutions in either a forward or reverse direction (see Definition 2.1) to bindings of logic variables. We give a constructive pr ..."
Abstract

Cited by 51 (2 self)
 Add to MetaCart
this paper, we present a generalized form of SVG, which supports additional biologicallyrelevant operations by going beyond homomorphisms, instead uniformly applying substitutions in either a forward or reverse direction (see Definition 2.1) to bindings of logic variables. We give a constructive proof of our conjecture [26] that the languages describable by SVG are contained in the indexed languages, and furthermore show that the containment is proper, thus refining the position of an important class of biological sequences in the hierarchy of languages. We also describe a simple grammar translator, give a number of examples of mathematical and biological languages, discuss the distinctions between SVG, DG, TAG, and RPDAs, and suggest extensions wellsuited to the overlapping languages of genes. Finally, we describe a largescale implementation of a domainspecific parser called GenLang which incorporates a practical version of these ideas, and which has been successful in parsing several types of genes from DNA sequence data [9, 30], in a form of patternmatching search termed syntactic pattern recognition [10]. 6 2. STRING VARIABLE GRAMMAR
Algorithms and Complexity for Annotated Sequence Analysis
, 1999
"... Molecular biologists use algorithms that compare and otherwise analyze sequences that represent genetic and protein molecules. Most of these algorithms, however, operate on the basic sequence and do not incorporate the additional information that is often known about the molecule and its pieces. Thi ..."
Abstract

Cited by 48 (1 self)
 Add to MetaCart
Molecular biologists use algorithms that compare and otherwise analyze sequences that represent genetic and protein molecules. Most of these algorithms, however, operate on the basic sequence and do not incorporate the additional information that is often known about the molecule and its pieces. This research describes schemes to combinatorially annotate this information onto sequences so that it can be analyzed in tandem with the sequence; the overall result would thus reflect both types of information about the sequence. These annotation schemes include adding colours and arcs to the sequence. Colouring a sequence would produce a samelength sequence of colours or other symbols that highlight or label parts of the sequence. Arcs can be used to link sequence symbols (or coloured substrings) to indicate molecular bonds or other relationships. Adding these annotations to sequence analysis problems such as sequence alignment or finding the longest common subsequence can make the problem more complex, often depending on the complexity of the annotation scheme. This research examines the different annotation schemes and the corresponding problems of verifying annotations, creating annotations, and finding the longest common subsequence of pairs of sequences with annotations. This work involves both the conventional complexity framework and parameterized complexity, and includes algorithms and hardness results for both frameworks. Automata and transducers are created for some annotation verification and creation problems. Different restrictions on layered substring and arc annotation are considered to determine what properties an annotation scheme must have to make its incorporation feasible. Extensions to the algorithms that use weighting schemes are explored. Examin...
RNA Pseudoknot Modeling Using Intersections of Stochastic Context Free Grammars with Applications to Database Search
 PACIFIC SYMPOSIUM ON BIOCOMPUTING
, 1995
"... A model based on intersections of stochastic context free grammars is presented to allow for the modeling of RNA pseudoknot structures. The model runs relatively fast, having the same order running time as stochastic context free grammar parsers. The model is shown to be able to perform database sea ..."
Abstract

Cited by 35 (2 self)
 Add to MetaCart
A model based on intersections of stochastic context free grammars is presented to allow for the modeling of RNA pseudoknot structures. The model runs relatively fast, having the same order running time as stochastic context free grammar parsers. The model is shown to be able to perform database searches and find RNA sequences which resemble RNA pseudoknots which bind biotin. The problem domain of RNA biotin binders has significance in the support of the RNA world model of early life on earth.
Recent Methods for RNA Modeling Using Stochastic ContextFree Grammars
, 1994
"... Stochastic contextfree grammars (SCFGs) can be applied to the problems of folding, aligning and modeling families of homologous RNA sequences. SCFGs capture the sequences' common primary and secondary structure and generalize the hidden Markov models (HMMs) used in related work on protein and ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
Stochastic contextfree grammars (SCFGs) can be applied to the problems of folding, aligning and modeling families of homologous RNA sequences. SCFGs capture the sequences' common primary and secondary structure and generalize the hidden Markov models (HMMs) used in related work on protein and DNA. This paper discusses our new algorithm, TreeGrammar EM, for deducing SCFG parameters automatically from unaligned, unfolded training sequences. TreeGrammar EM, a generalization of the HMM forwardbackward algorithm, is based on tree grammars and is faster than the previously proposed insideoutside SCFG training algorithm. Independently, Sean Eddy and Richard Durbin have introduced a trainable "covariance model" (CM) to perform similar tasks. We compare and contrast our methods with theirs.
Formal Language Theory and Biological Macromolecules
 Series in Discrete Mathematics and Theoretical Computer Science
, 1999
"... Biological macromolecules can be viewed, at one level, as strings of symbols. Collections of such molecules can thus be considered to be sets of strings, i.e. formal languages. This article reviews languagetheoretic approaches to describing intramolecular and intermolecular structural interactions w ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
(Show Context)
Biological macromolecules can be viewed, at one level, as strings of symbols. Collections of such molecules can thus be considered to be sets of strings, i.e. formal languages. This article reviews languagetheoretic approaches to describing intramolecular and intermolecular structural interactions within these molecules, and evolutionary relationships between them. 1 Introduction The author has for some time been investigating the application of formal language theory to biological macromolecules, primarily nucleic acids because of the relative simplicity of the biochemical structures and interactions. After introducing the very simple mathematical foundations for these investigations, this article will review three major lines of research. These can largely be found in more fully developed form in referenced publications, though some new material is also included in each case. The sections below will deal with the use of formal grammars to describe intramolecular interactions [17, 18...
The Application of Stochastic ContextFree Grammars to Folding, Aligning and Modeling Homologous RNA Sequences
, 1993
"... Stochastic contextfree grammars (SCFGs) are applied to the problems of folding, aligning and modeling families of homologous RNA sequences. SCFGs capture the sequences' common primary and secondary structure and generalize the hidden Markov models (HMMs) used in related work on protein and ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
Stochastic contextfree grammars (SCFGs) are applied to the problems of folding, aligning and modeling families of homologous RNA sequences. SCFGs capture the sequences' common primary and secondary structure and generalize the hidden Markov models (HMMs) used in related work on protein and DNA. The novel aspect of this work is that SCFG parameters are learned automatically from unaligned, unfolded training sequences. A generalization of the HMM forwardbackward algorithm is introduced to do this. The new algorithm, TreeGrammar EM, based on tree grammars and faster than the previously proposed SCFG insideoutside training algorithm, produced a model that we tested on the transfer RNA (tRNA) family. Results show that after having been trained on as few as 20 tRNA sequences from only two tRNA subfamilies (mitochondrial and cytoplasmic), the model can discern general tRNA from similarlength RNA sequences of other kinds, can find secondary structure of new tRNA sequences, and c...
A Constraint Based Structure Description Language for Biosequences
, 1997
"... We report an investigation into how constraint solving techniques can be used to search for patterns in sequences (or strings) of symbols over a finite alphabet. We define a constraintbased structure description language for biosequences, and give the definition of an algorithm to solve the stru ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
We report an investigation into how constraint solving techniques can be used to search for patterns in sequences (or strings) of symbols over a finite alphabet. We define a constraintbased structure description language for biosequences, and give the definition of an algorithm to solve the structure searching problem as a CSP. The methodology which we have developed is able to describe twodimensional structure of biosequences, such as tandem repeats, stem loops, palindromes and pseudoknots. We also report on an implementation of the language in the constraint logic programming language clp(FD), with test results of a simple searching algorithm, and results from a preliminary implementation in C++ using consistency checking techniques from solving CSP. Keywords: constraints, biostructures, description language, searching. 1 Introduction The aim of the work described in this paper is to use constraint solving techniques to search for structural patterns in sequences (or st...
A genetic algorithm to obtain the optimal recurrent neural network
 International Journal of Approximate Reasoning
, 2000
"... Selecting the optimal topology of a neural network for a particular application is a di�cult task. In the case of recurrent neural networks, most methods only induce topologies in which their neurons are fully connected. In this paper, we present a genetic algorithm capable of obtaining not only the ..."
Abstract

Cited by 9 (4 self)
 Add to MetaCart
Selecting the optimal topology of a neural network for a particular application is a di�cult task. In the case of recurrent neural networks, most methods only induce topologies in which their neurons are fully connected. In this paper, we present a genetic algorithm capable of obtaining not only the optimal topology of a recurrent neural network but also the least number of connections necessary. Finally, this genetic algorithm is applied to a problem of grammatical inference using neural networks, with
StructWeb: Biosequence structure searching on the Web using clp(FD)
, 1997
"... We describe an implementation in a finite domain constraint logic programming language of a webbased biosequence structure searching program. We have used the clp(FD) language for the implementation of our search engine and have ported the PiLLoW libraries to clp(FD). Our program is based on CBSDL, ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
We describe an implementation in a finite domain constraint logic programming language of a webbased biosequence structure searching program. We have used the clp(FD) language for the implementation of our search engine and have ported the PiLLoW libraries to clp(FD). Our program is based on CBSDL, a constraint based structure description language for biosequences, and uses constrained descriptions to search for the secondary structure of biosequences, such as tandem repeats, stem loops, palindromes and pseudoknots. Keywords: constraints, biostructures, description language, searching, WWW interface. 1 Introduction In this paper we report an implementation of a webbased biosequence structure searching program. This implementation has been constructed using the finite domain constraint logic programming language clp(FD) [9] together with our own port of the PiLLoW library [7] to clp(FD). Our search engine is based on a CBSDL, a constraint based structure description language for bio...