| Karlin S, Altschul S.F., "Applications and statistics for multiple high-scoring segments in molecular sequences" Proc. Natl. Acad. Sci. USA, 90:5873--5877, June 1993. |
....exceptional patterns in nucleotidic sequences, using various tools to assess the signi cance of such rare events. Large deviation is a mathematical area that deals with rare events; to our knowledge, it has not been used in computational biology, although the extremal statistics on alignments [6] can be viewed as large deviation results. Nevertheless, our recent results in [3] that extend preliminary results in [17] show it may be a very powerful method to assess statistical signi cance of very rare events. The rst problem we address is the following. One considers a candidate, e.g. a ....
R. Karlin and S.F. Altschul. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. U.S.A., 90:5873-5877, 1993.
....v is the frequency vector of s, and B is the MBR that covers the frequency vectors of the strings in These conditions must be satisfied for any valid scoring scheme. If first condition is not met, then the best matching substring of two random string would always tend to be the whole sequence [28]. The second condition implies that there are at least two letters with a positive score match. Let f( i,j PiPj eAs(ai aj) BLAST uses the unique positive solution, to the equation f( 1 in the statistical computation [27] Karlin and Altschul [27] show that the expected value for the ....
....g( f( 1, and use the Newton Raphson Method [20] to find the positive root of g( 0. In our experiments, this method converged in a few iterations. 13 5. 2 GIS: Interactivity using BLAST statistics Here, we develop our second interactive search technique based on BLAST statistics [27, 28]. We call this technique Global statistics based Interactive Search (GIS) Similar to LIS, GIS partitions the database strings into overlapping blocks of length w c 1 with an overlap of w 1 letters between consecutive blocks, where w is the window length and c is the box capacity of each ....
S. Karlin and S. Altschul. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci., 90:5873-5877, June 1993.
....in BLAST [1] and FASTA [3] the two most popular algorithms. WU BLAST and NCBI BLAST are both implementations of the BLAST algorithm, differing in the way score statistics are generated, as well as some heuristics. For example, WU BLAST implements and reports Karlin and Altschul sum statistics [16], 17] by default. Alignments are generated using a scoring scheme that includes a substitution matrix and gap parameters. Substitution matrices for protein sequence alignments are 20 20 matrices that give scaled, log odds scores for the pairing of any two aligned amino acid residues in an ....
S. Karlin and S. F. Altschul, "Applications and statistics for multiple high-scoring segments in molecular sequences," in Proc. Nat. Acad. Sci. USA, vol. 90, 1993, pp. 5873--5877.
....or human intervention by a skilled biologist at some point in the process. For this reason, it is common to use simpler techniques to screen large databases to identify a modest number of sequences most likely to be worthy of detailed examination. BLAST (Basic Local Alignment Search Tool) [1 3,9,10] is by far the most widely used application for rapid screening of large sequence databases. The inputs to BLAST are a set of input query sequences and a number of DNA or protein databases. For each input query sequence, BLAST determines a group of sequences in the databases that have high scoring ....
....similarity searching. For a basic discussion of bioinformatics and sequence similarity searching, see [4] and [7] One of the earliest algorithms for performing sequence similarity searching using pairwise alignment was implemented in the FASTA program [11,12] 2. 2 The BLAST Algorithm BLAST [1 3,9,10] is certainly the most popular algorithm for sequence similarity searching. The approach used by the BLAST algorithm is to first identify short segments with high scoring alignments without gaps, and then to extend each such local alignment as far as possible in both directions, with or without ....
Karlin, S. & Altschul, S.F., "Applications and statistics for multiple high-scoring segments in molecular sequences." Proc. Natl. Acad. Sci. USA 90:5873-5877 (1993).
....exceptional patterns in nucleotidic sequences, using various tools to assess the significance of such rare events. Large deviation is a mathematical area that deals with rare events; to our knowledge, it has not been used in computational biology, although the extremal statistics on alignments [6] can be viewed as large deviation results. Nevertheless, our recent results in [3] that extend preliminary results in [17] show it may be a very powerful method to assess statistical significance of very rare events. The first problem we address is the following. One considers a candidate, e.g. ....
R. Karlin and S.F. Altschul. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. U.S.A., 90:5873--5877, 1993.
....to be of evolutionary origin. Rigorous results on such background statistics are known only for the gapless alignment, whose score distribution follows the so called Gumbel form (Gumbel, 1958) pdf(S) KMN exp S KMNe S ; 1) for long sequence lengths M and N (Arratia et al. 1988; Karlin Altschul, 1990, 1993; Karlin Dembo, 1992) Explicit formulae relating the hundreds of alignment parameters to the two Gumbel parameters and K are available (Karlin Altschul, 1990) For gapped sequence alignment with large enough gap cost, the score distribution is also empirically known to obey Gumbel ....
Karlin, S., and Altschul, S.F., 1993. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. USA 90:5873-5877.
....of unusual subsequences is an important task, since such features may be biologically significant. A common approach is to assign a score to each residue, and then look for contiguous subsequences with high total scores. This natural approach was analyzed by Altschul and Erickson (1986a; 1986b) Karlin and Altschul (1990; 1993), Dembo Karlin 1991; Karlin Dembo 1992) and (Karlin, Dembo, Kawabata 1990) and applied to a variety of protein analyses such as the identification of transmembrane regions, DNA binding domains, and regions of high charge (Brendel et al. 1992; Karlin Brendel 1992; Karlin et al. 1991) ....
....form the optimal scores, assuming that the target and background frequencies are accurate. Returning to the human b 2 adrenergic receptor, Karlin and Brendel observed that the highest scoring subsequences were similar to the ones obtained with the hydropathy scores, but were more pronounced. Karlin and Altschul (1993) applied the same scoring function to identify transmembrane domains in the Drosophila virilis sevenless protein, and in the human serotonin receptor. The authors emphasis in that paper was on finding multiple disjoint high scoring subsequences corresponding, for instance, to multiple ....
[Article contains additional citation context not shown here]
Karlin, S., and Altschul, S. F. 1993. Applications and statistics for multiple high-scoring segments in molecular sequences. Proceedings of the National Academy of Science USA 90:5873--5877.
....sums of random variables. In general, P values give similar results to more conventional scores, such as percent identity, but they have been shown to be better calibrated and more sensitive for marginal similarities, taking into account compositional biases of the databank and the query sequence [94, 132, 133]. In particular, Brenner et al. tested the applicability of probabilistic scores to the detection of structural relationships [67, 139, 140] They found that the FASTA e value closely tracked the error rate against a test set of known structural relationships. That is, with regard to the number of ....
Karlin, S & Altschul, S F (1993) Applications and statistics for multiple high-scoring segments in molecular sequences. Proceedings of the National Academy of Sciences of the United States of America 90, 5873-5877.
....genes and on the reconstruction of phylogenic trees (Doolittle, 1996) There are two types of sequence alignment algorithms used. Gapless alignment such as BLAST (Altschul et al. 1990) or FASTA (Lipman and Pearson, 1985; Pearson and Lipman, 1988) is well understood (Arratia et al. 1988; Karlin and Altschul, 1990, 1993; Dembo and Karlin, 1991) and is extensively used as a first cut in large scale database searches. However due to the occurrence of insertions and deletions in the evolution of biological sequences, weak sequence homologies can only be detected by alignment algorithms which allow for gaps, e.g. ....
Karlin, S., and Altschul, S.F. 1993. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. USA 90, 5873--5877.
....negative containing degenerated motif not detectable by signature) P (false negative lacking motif region, mostly fragmentary) and F (false positive containing signature) The BLAST search was performed using the improved version (version 1. 4, October 1994) that adopted Sum statistics [Karlin Altschul, 1993]. The program was obtained from the NCBI FTP server (ncbi.nlm.nih.gov) and implemented on our DEC alpha workstation running on OSF 1 operating system. The same training set (containing both positive and negative sequences) and prediction set used in MOTIFIND (Table 1) were used as BLAST database ....
Karlin, S. & Altschul, S. F. (1993) Applications and statistics for multiple high- scoring segments in molecular sequences.Proc. Natl. Acad. Sci. USA, 90, 5873 - 5877.
....implemented. The problem can be stated in the following simply statistical terms: given a local similarity between two sequences, each with a given per residue frequency, evaluate the probability for this similarity to belong to the by chance similarities set. Analitycal expresions can be found in [2, 6, 7, 15] but this approaches based on the best pairwise sequence alignment (the fragment with the greater score) becomes inconsistent when applied to all fragments. To asses the fragment statistical significance we compute the by random similarities distribution in the following way: two random sequences ....
Karlin S. and Altschul, S.F. (1993), "Applications and statistics for multiple high-scoring segments in molecular sequences", Proc. Natl.Acad.Sci.USA, 90, 5873-5877
....of unusual subsequences is an important task, since such features may be biologically significant. A common approach is to assign a score to each residue, and then look for contiguous subsequences with high total scores. This natural approach was analyzed by Altschul and Erickson (1986a; 1986b) Karlin and Altschul (1990; 1993), Dembo Karlin 1991; Karlin Dembo 1992) and (Karlin, Dembo, Kawabata 1990) and applied to a variety of protein analyses such as the identification of transmembrane regions, DNA binding domains, and regions of high charge (Brendel et al. 1992; Karlin Brendel 1992; Karlin et al. 1991) ....
....form the optimal scores, assuming that the target and background frequencies are accurate. Returning to the human b 2 adrenergic receptor, Karlin and Brendel observed that the highest scoring subsequences were similar to the ones obtained with the hydropathy scores, but were more pronounced. Karlin and Altschul (1993) applied the same scoring function to identify transmembrane domains in the Drosophila virilis sevenless protein, and in the human serotonin receptor. The authors emphasis in that paper was on finding multiple disjoint high scoring subsequences corresponding, for instance, to multiple ....
[Article contains additional citation context not shown here]
Karlin, S., and Altschul, S. F. 1993. Applications and statistics for multiple high-scoring segments in molecular sequences. Proceedings of the National Academy of Science USA 90:5873--5877.
....the probability for obtaining a relatively large score just by chance increases dramatically. In order to be able to quote a p value the distribution of optimal alignment scores for alignments of random sequences has to be known. In the case of gapless alignment it has been worked out rigorously [17, 18, 19] that this distribution is a Gumbel or extreme value distribution [14] It is characterized by two parameters which depend on the scoring system used and on the amino acid frequencies with which the random sequences are generated. For gapless alignment also this dependence of the two Gumbel ....
....it is necessary to know the distribution of for the gapless alignment of two random sequences, whose elements a k s are generated independently from the same frequencies p a as the query sequences, and scored with the same matrix s a;b . This distribution of has been worked out rigorously [18, 19]. For suitable scoring parameters, it is a Gumbel or extreme value distribution given by Prf Sg = exp( e S ) 4) This distribution is characterized by the two parameters and with giving the tail of the distribution and 1 log describing the mean. For gapless alignment, these ....
[Article contains additional citation context not shown here]
Karlin, S., and Altschul, S.F. 1993. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci U.S.A. 90, 5873-5877.
....a search over the ever expanding sequence databases. It is therefore imperative to understand quantitatively the statistics of these rare, highscoring events, in order to estimate the statistical significance of a high scoring alignment. In the case of gapless alignment, it is known rigorously (Karlin and Altschul 1990, 1993; Karlin and Dembo 1992) that the distribution of alignment scores of random sequences is the Gumbel or extreme value distribution (Gumbel 1958) which has a much broader (i.e. exponential) tail than that of the Gaussian distribution. The Gumbel distribution is specified completely by two ....
Karlin, S., and Altschul, S.F. 1993. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. USA 90:5873-- 5877.
....of all locally optimal (non overlapping) segment pairs. These results are reported in a format similar to that generated by the BLAST family of programs. The overall significance of the similarity between query and database sequences is estimated by the sum statistics of Karlin and Altschul (Karlin and Altschul, 1993). Singh, et al. BioSCAN: A Network Sharable Computational Resource 3 Implementation The BioSCAN server can be accessed by either electronic mail (e mail) or the World Wide Web (WWW) The queries submitted via e mail are processed on a first come first serve basis. This method is most suited ....
.... information on sequence name, description, the best sum statistic (P (n) and the number of alignments (n) The best sum statistic, P (n) estimates the probability of a set of n locally best alignments for two sequences occurring by chance, given the size and composition of the search space (Karlin and Altschul, 1993). The smaller the value of P (n) the more significant is the alignment. Note that the value of P (n) is dependent on the search space and a comparison of best sum statistic scores generated by searching different Singh, et al. BioSCAN: A Network Sharable Computational Resource 7 databases is not ....
Karlin, S. and Altschul, S. F. (1993). Application and statistics for multiple high-scoring segments in molecular sequences. Proceedings of the National Academy of Sciences, 90:5873--5877.
....a score of at least x, would appear by chance in one pairwise comparison. This approach has the disadvantage of being dependent on the lowest score among the k highest scores. Another alternative is to calculate the sum S k of the highest k scores. The distribution of such sums has been derived [Karlin Altschul 1993] and the probability of a given sum is calculated (numerically) by a double integral on the tail of the distribution. In either case, the HSPs should first satisfy a consistency test before the joint assessment is made. Local alignment with gaps Though local alignments without gaps may detect ....
Karlin, S. & Altschul, S. F. (1993). Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl Acad. Sci. USA 90, 5873-5877.
....sequences to this set of known energy related sequences from other species. Those that exhibit strong homology would be the energy related Arabidopsis sequences. In order to carry out this procedure smoothly, one has to integrate Entrez with a reliable homology search system such as WU BLAST2.0 [1, 17]. The procedure as described above can identify energyrelated Arabidopsis sequences. However, we are still a little far from being able to extract their Kozak sequences. To do that we need to first locate the start codon of each sequence. To locate the start codon we need to inspect the so called ....
....c.#p, 1) A then modify #a of c to 1 c.#a else if string substring (s, c.#p, 1) C then modify #c of c to 1 c.#c else if string substring (s, c.#p, 1) G then modify #g of c to 1 c.#g else modify #t of c to 1 c. #t c S ] #p: p, #a: 0, #c: 0, #g: 0, #t: 0) p [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27 ]] nr in [ #position: p, #consensus: mk consensus (C) #a pc: a pc, #c pc: c pc, #g pc: g pc, #t pc: t pc) q hist, a pc = mk pc (q.#a) c pc = mk pc (q.#c) g pc = mk pc (q.#g) t pc = mk pc (q.#t) C = #l: a , #c: a pc) #l: c , #c: c pc) #l: g , #c: g pc) #l: ....
S. Karlin and S. F. Altschul. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. USA, 90:5873--5877, June 1993.
.... these algorithms are available, two of the most widely used being SSEARCH [PL88] and BLAST [AG96] A considerable amount of research has been directed towards the problem of assigning statistical significance to the scores obtained by these methods, most notably by Karlin and Altschul [KA90, KA93, AG96] In parallel with pairwise alignment, many algorithms for simultaneous or progressive alignment of multiple sequences were also developed [SK83, WP84, CL88, FD87, Tay87, BS87, HS89] Without going into excessive detail, these algorithms attempt to find shortcuts to calculating the full ....
S. Karlin and S. F. Altschul. Applications and statistics for multiple high-scoring segments in molecular sequences. Proceedings of the National Academy of Sciences of the USA, 90:5873--5877, 1993.
....and end points of each locally optimal (non overlapping) segment pair. These results are reported in a format similar to that generated by the BLAST program. The overall significance of the similarity between query and entry sequences is estimated by the sum statistics of Karlin and Altschul (Karlin and Altschul, 1993). 4 ALGORITHM Algorithms for detecting and measuring similarities between biosequences have continuously evolved over the last three decades. One of the earliest exhaustive sequence comparison methods is that of Needleman and Wunsch (Needleman and Wunsch, 1970) Subsequently, many algorithmic ....
Karlin, S. and Altschul, S. F. (1993). Application and statistics for multiple high-scoring segments in molecular sequences. Proceedings of the National Academy of Sciences, 90:5873--5877.
No context found.
Karlin S, Altschul S.F., "Applications and statistics for multiple high-scoring segments in molecular sequences" Proc. Natl. Acad. Sci. USA, 90:5873--5877, June 1993.
No context found.
S. Karlin and S. Altschul. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc Natl Acad Sci USA, 90(12):5873--7, June 15 1993.
No context found.
S. Karlin and S. Altschul. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc Natl Acad Sci USA, 90(12):58737, June 15 1993.
No context found.
S. Karlin and S. F. Altschul. Applications and statistics for multiple highscoring segments in molecular sequences. Proc. Natl. Acad. Sci. USA, 90:5873-5877, 1993.
No context found.
KARLIN, S. and S. F. ALTSCHUL. 1993. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. USA 90: 5873--5877.
No context found.
Karlin, S., Altschul, S.F. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Aad. Sci. USA 90:5873--5877, 1993.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC