Results 1  10
of
39
Alignmentfree sequence comparisona review
 Bioinformatics
, 2003
"... Motivation: Genetic recombination and, in particular, genetic shuffling are at odds with sequence comparison by alignment, which assumes conservation of contiguity between homologous segments. A variety of theoretical foundations are being used to derive alignmentfree methods that overcome this lim ..."
Abstract

Cited by 102 (8 self)
 Add to MetaCart
(Show Context)
Motivation: Genetic recombination and, in particular, genetic shuffling are at odds with sequence comparison by alignment, which assumes conservation of contiguity between homologous segments. A variety of theoretical foundations are being used to derive alignmentfree methods that overcome this limitation. The formulation of alternative metrics for dissimilarity between sequences and their algorithmic implementations are reviewed. Results: The overwhelming majority of work on alignmentfree sequence has taken place in the past two decades, with most reports published in the past 5 years. Two main categories of methods have been proposed—methods based on word (oligomer) frequency, and methods that do not require resolving the sequence with fixed word length segments. The first category is based on the statistics of word frequency, on the distances defined in a Cartesian space defined by the frequency vectors, and on the information content of frequency distribution. The second category includes the use of Kolmogorov complexity and Chaos Theory. Despite their low visibility, alignmentfree metrics are in fact already widely used as preselection filters for alignmentbased querying of large applications. Recent work is furthering their usage as a scaleindependent methodology that is capable of recognizing homology when loss of contiguity is beyond the possibility of alignment. Availability: Most of the alignmentfree algorithms reviewed were implemented in MATLAB code and are available
SR: A probabilistic model of local sequence alignment that simplifies statistical significance estimation
 PLoS Comput Biol
"... Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (l) requires timeconsuming computational simulation. Moreover, optimal ali ..."
Abstract

Cited by 49 (12 self)
 Add to MetaCart
Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (l) requires timeconsuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (‘‘Forward’ ’ scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (‘‘Viterbi’ ’ scores) are Gumbeldistributed with constant l = log 2, and the high scoring tail of Forward scores is exponential with the same constant l. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profilehidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (Evalues) for both Viterbi and Forward scores for probabilistic local alignments.
Replication stress links structural and numerical cancer chromosomal instability. Nature 494
, 2013
"... These authors contributed equally to this work. Cancer chromosomal instability (CIN) results in an elevated rate of change of chromosome number and structure and generates intratumour heterogeneity1,2. CIN is observed in the majority of solid tumours and is associated with both poor prognosis and dr ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
These authors contributed equally to this work. Cancer chromosomal instability (CIN) results in an elevated rate of change of chromosome number and structure and generates intratumour heterogeneity1,2. CIN is observed in the majority of solid tumours and is associated with both poor prognosis and drug resistance3,4. Therefore, understanding a mechanistic basis for CIN is paramount. Here we find evidence for impaired replication fork progression and elevated DNA replication stress in CIN+ colorectal cancer (CRC) cells relative to CIN − CRC cells, with structural chromosome abnormalities precipitating chromosome missegregation in mitosis. We identify three novel CINsuppressor genes (PIGN (MCD4), RKHD2 (MEX3C) and ZNF516 (KIAA0222)) encoded on chromosome 18q, which is Users may view, print, copy, download and text and data mine the content in such documents, for the purposes of academic research, subject always to the full Conditions of use:
Scan statistics with weighted observations
 Journal of the American Statistical Association
, 2007
"... We examine scan statistics for onedimensional marked Poisson processes. Such statistics tabulate the maximum weighted count of event occurrences within a window of predetermined width over all windows within an observed interval. We derive analytical formulas and also give an importance sampling me ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
(Show Context)
We examine scan statistics for onedimensional marked Poisson processes. Such statistics tabulate the maximum weighted count of event occurrences within a window of predetermined width over all windows within an observed interval. We derive analytical formulas and also give an importance sampling method for approximating the tail probabilities of scan statistics. Because highthroughput genomic sequencing has led to the availability of massive amounts of biomolecular sequence data, it is often of interest to search long DNA or protein sequences for local regions that are enriched for a certain characteristic. Thus scan statistics have become a useful tool in modern computational biology. We illustrate the application of our p value approximations with such examples.
Local alignment of Markov chains
, 2006
"... We consider local alignments without gaps of two independent Markov chains from a finite alphabet, and we derive sufficient conditions for the number of essentially different local alignments with a score exceeding a high threshold to be asymptotically Poisson distributed. From the Poisson approxima ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
We consider local alignments without gaps of two independent Markov chains from a finite alphabet, and we derive sufficient conditions for the number of essentially different local alignments with a score exceeding a high threshold to be asymptotically Poisson distributed. From the Poisson approximation a Gumbel approximation of the maximal local alignment score is obtained. The results extend those obtained by Dembo, Karlin and Zeitouni [Ann. Probab. 22 (1994) 2022–2039] for independent sequences of i.i.d. variables. 1. Introduction. Local
Local sequence alignments statistics: Deviations from Gumbel statistics in the rareevent tail
, 2007
"... ..."
(Show Context)
ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES
, 909
"... The gapped local alignment score of two random sequences follows a Gumbel distribution. If computers could estimate the parameters of the Gumbel distribution within one second, the use of arbitrary alignment scoring schemes could increase the sensitivity of searching biological sequence databases ov ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
(Show Context)
The gapped local alignment score of two random sequences follows a Gumbel distribution. If computers could estimate the parameters of the Gumbel distribution within one second, the use of arbitrary alignment scoring schemes could increase the sensitivity of searching biological sequence databases over the web. Accordingly, this article gives a novel equation for the scale parameter of the relevant Gumbel distribution. We speculate that the equation is exact, although present numerical evidence is limited. The equation involves ascending ladder variates in the global alignment of random sequences. In global alignment simulations, the ladder variates yield stopping times specifying random sequence lengths. Because of the random lengths, and because our trial distribution for importance sampling occurs on a different sample space from our target distribution, our study led to a mapping theorem, which led naturally in turn to an efficient dynamic programming algorithm for the importance sampling weights. Numerical studies using several popular alignment scoring schemes then examined the efficiency and accuracy of the resulting simulations. 1. Introduction. Sequence
THE MAXIMUM OF A RANDOM WALK REFLECTED AT A GENERAL BARRIER
, 2006
"... We define the reflection of a random walk at a general barrier and derive, in case the increments are light tailed and have negative mean, a necessary and sufficient criterion for the global maximum of the reflected process to be finite a.s. If it is finite a.s., we show that the tail of the distrib ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
We define the reflection of a random walk at a general barrier and derive, in case the increments are light tailed and have negative mean, a necessary and sufficient criterion for the global maximum of the reflected process to be finite a.s. If it is finite a.s., we show that the tail of the distribution of the global maximum decays exponentially fast and derive the precise rate of decay. Finally, we discuss an example from structural biology that motivated the interest in the reflection at a general barrier. 1. Introduction. The
Analysis of gene expression in ceca of Helicobacter hepaticusinfected A/JCr mice before and after development of typhlitis. Infect Immun 71
, 2003
"... The inflammatory bowel diseases, Crohn’s disease and ulcerative colitis, are chronic inflammatory disorders of the gastrointestinal tract. The causes of these diseases remain unknown; however, prevailing theories suggest that chronic intestinal inflammation results from a dysregulated immune respon ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
The inflammatory bowel diseases, Crohn’s disease and ulcerative colitis, are chronic inflammatory disorders of the gastrointestinal tract. The causes of these diseases remain unknown; however, prevailing theories suggest that chronic intestinal inflammation results from a dysregulated immune response to ubiquitous bacterial antigens. While a substantial body of data has been amassed describing the role of the adaptive immune system in perpetuating and sustaining inflammation, very little is known about the early signals, prior to the development of inflammation, that initiate and direct the abnormal immune response. To this end, we characterized the gene expression profile of A/JCr mice with Helicobacter hepaticusinduced typhlitis at month 1 of infection, prior to the onset of histologic disease, and month 3 of infection, after chronic inflammation is fully established. Analysis of the gene expression in ceca of H. hepaticus infected mice revealed 25 upregulated and 3 downregulated genes in the month1 postinoculation group and 31 upregulated and 2 downregulated genes in the month3 postinoculation group. Among these was a subset of immunerelated genes, including interferoninducible protein 10, monokine induced by gamma interferon, macrophageinduced protein 1 alpha, and serum amyloid A1. Semiquantitative realtime reverse transcriptase PCR confirmed the increased expression levels of these genes, as well as elevated expression of gamma interferon. To our knowledge, this is the first report profiling cecal gene expression in H. hepaticusinfected A/JCr mice. The findings of altered gene ex
1 Towards a theoretical understanding of false positives in DNA motif finding
, 2010
"... Detection of falsepositive motifs is one of the main causes of low performance in motif finding methods. It is generally assumed that falsepositives are mostly due to algorithmic weakness of motiffinders [1–3]. Here, however, we derive the theoretical dependence of false positives on dataset size ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Detection of falsepositive motifs is one of the main causes of low performance in motif finding methods. It is generally assumed that falsepositives are mostly due to algorithmic weakness of motiffinders [1–3]. Here, however, we derive the theoretical dependence of false positives on dataset size and find that false positives can arise as a result of large dataset size, irrespective of the algorithm used. Interestingly, the falsepositive strength depends more on the number of sequences in the dataset than it does on the sequence length. As expected, falsepositives can be reduced by decreasing the sequence length or by adding more sequences to the dataset. The dependence on number of sequences, however, diminishes and reaches a plateau after which adding more sequences to the dataset does not reduce the falsepositive rate significantly. Based on the theoretical results presented here, we provide a number of intuitive rules of thumb that may be used to enhance motiffinding results in practice.