| Henikoff, S. & Henikoff, J. G. (1991). Automated assembly of protein blocks for database searching. Nucl. Acids Res. 19, 6565 -- 6572. |
....ancestors with previously studied models. Karplus presents an information theoretic framework for measuring the effectiveness of a method. Different methods for probability estimation are compared by independently estimating probabilities of columns of multiple alignments from the BLOCKS database [20, 18]. Using the notation from [23] we denote the total count for each amino acid i in a column t by F t (i) If we use sequence weights such as those presented in [21] F t (i) are not necessarily integers. The total count for each column is denoted jF t j = i F t (i) Using an entire column, we ....
....the entropy given in equation (20) By comparing encoding costs between two methods, we can compare the performance of two methods. We perform our experiments over the same data set that was used in [23] The data consists of sets of observed counts of amino acids taken from the BLOCKS database [20, 18]. The counts are weighted using a position speci c weighting scheme described in [21] with slight variations presented in [28] The data set was split into disjoint training and test subsets. For each experiment, we compute equation (20) for a different size of the sample k where 0 k 5. For ....
S. Henikoff and J. G. Henikoff. Automated assembly of protein blocks for database searching. Nucleic Acids Research, 19(23):65656572, 1991.
....ancestors with previously studied models. Karplus presents an information theoretic framework for measuring the effectiveness of a method. Different methods for probability estimation are compared by independently estimating probabilities of columns of multiple alignments from the BLOCKS database [19, 17]. Using the notation from [22] we denote the total count for each amino acid # in a column # by # # ###. If we use sequence weights such as those presented in [20] # # ### are not necessarily integers. The total count for each column is denoted ## # # # # # # # ###. Using an entire column, we ....
....the entropy given in equation (20) By comparing encoding costs between two methods, we can compare the performance of two methods. We perform our experiments over the same data set that was used in [22] The data consists of sets of observed counts of amino acids taken from the BLOCKS database [19, 17]. The counts are weighted using a position specific weighting scheme described in [20] with slight variations presented in [27] The data set was split into disjoint training and test subsets. For each experiment, we compute equation (20) for a different size of the sample # where # # # # #. For ....
S. Henikoff and J. G. Henikoff. Automated assembly of protein blocks for database searching. Nucleic Acids Research, 19(23):6565--6572, 1991.
....9 3.4 Extending to alphabet specific scores Genome search tools for proteins (e.g. BLASTP) usually employ an alphabet specific scoring model. In this model, each character pair is assigned a score. This is represented using a score matrix. Some of the popular score matrices are PAM and BLOSUM [20, 21, 22]. The algorithm in Figure 6 can be easily extended to find an upper bound to the best score under this model. In order to do this, once the amount of increments and decrements for each character are determined in Step 2, the algorithm iteratively matches an increasing character and a decreasing ....
S. Henikoff and J. G. Henikoff. Automated assembly of protein blocks for database searching. Nucl. Acids Res., pages 6565--72, 1991.
....protein sequence PVKTNVK can be 1 The total number of possible patterns from 2 gram encoding is n 2 where n is the number of different letters, namely 20, in the protein alphabet. 2 Both PAM and BLOSUM [18] are amino acid substitution matrices; the latter is derived from the BLOCKS database [17]. 4 represented as e 4 e 5 e 1 e 4 e 2 e 5 e 1 . The 2 gram exchange group encoding for this sequence is: 1 for e 4 e 5 , 2 for e 5 e 1 , 1 for e 1 e 4 , 1 for e 4 e 2 , and 1 for e 2 e 5 . For each protein sequence, we apply both the 2 gram amino acid encoding and the 2 gram exchange group ....
S. Henikoff and J. G. Henikoff. Automated assembly of protein blocks for database searching. Nucleic Acids Research 19, 6565--6572, 1991.
....often new methods are needed. Much of the recent interest in computational biology has focused on group analyses, such as the classification of families and super families of protein sequences and structures, and the compilation of protein family databases, such as PROSITE [Bairoch 1991] BLOCKS [Henikoff Henikoff 1991], and HSSP [Sander Schneider 1991] In light of these advances in computational biology, we present in this paper an empirical analysis of amino acid substitution using a group perspective. We introduce a novel method for identifying groups of amino acids that substitute for one another with ....
....databases of protein families or multiple sequence alignments. Two of largest and most widely used protein family databases are the BLOCKS and HSSP databases. Although these databases have distinct characteristics, they can still be viewed as collections of aligned positions. The BLOCKS database [Henikoff Henikoff 1991] contains short, highly conserved regions of protein families, represented by ungapped multiple alignments called blocks. Blocks are generated from a set of related protein sequences. Conserved regions are then found within these sequences using a motif finding program [Smith HO, Annau, and ....
Henikoff, S. and Henikoff, J. G. 1991. Automated assembly of protein blocks for database searching. Nucl Acids Res 19:6565--6572.
....inserting, or deleting an amino acid residue. Such errors may obscure an underlying specific motif. Because of these complications in the data, researchers have developed compensatory methods for analyzing training sets. One way to handle incoherent data is to resort to probabilistic motifs [Henikoff Henikoff 1991] or profiles [Gribskov, Luthy, Eisenberg 1990] Profiles have been used to generate training sets and identify motifs simultaneously [Tatusov, Altschul, Koonin 1994] However, probabilistic representations give poor insight into the structure or function of a protein. Moreover, probabilistic ....
Henikoff, S. and Henikoff, J. G. 1991. Automated assembly of protein blocks for database searching. Nucleic Acids Research 19(23):6565--6572.
....interactions, size constraints and the hydrophobic effect to list the most important. Sequence analysis techniques including database search (Wilbur Lipman, 1983) sequence classification (Klein DeLisi, 1986; Klein, Kanehisa DeLisi, 1984) and analysis of motifs (Bairoch Boeckmann, 1991; Henikoff Henikoff, 1991), among others, almost always assume conditional independence of residues in a sequence for computational efficiency. In other words, the observation of an amino acid at a specific position in a sequence has no effect on any other amino acid position. Clearly this simplifying assumption is ....
Henikoff, S. and Henikoff, J. G. 1991. Automated assembly of protein blocks for database searching. Nucl.
....the motif, position by position, by giving the frequency at which each residue occurs in each of the motif positions. Profiles are useful in describing and searching for families of sequences (Schneider, et al. 1986; Gribskov et al. 1987; Gribskov, et. al, 1990; Schneider and Stephens, 1990; Henikoff and Henikoff, 1991; Tatusov, et al. 1994; Pietrokvoski et al. 1996; Gribskov and Veretnik, 1996) Implementations of the Gibbs Sampler run in N linear time. The approximate running time is TNLW where T is the number of iterations through the sampler (typically around 100) N is the number of input sequences, ....
Henikoff, S., Henikoff, J.G., (1991) "Automated assembly of protein blocks for database searching." Nucleic Acids Research, 19(23):6565-6572.
....structure allows one to define a protein module (or shared part) in both a more precise and more general sense. It is possible (and quite productive) to define modules purely in terms of conserved blocks in sequence alignments or small, but distinctive, motifs shared by many related proteins [45 58]. However, functioning protein modules fundamentally consist of units of 3D structure. In fact, it is usually believed that these structural units form physically interacting folding domains, and attempts have been made to see how well they correspond to exon boundaries and other linear sequence ....
Henikoff, S & Henikoff, J G (1993) Automated assembly of protein blocks for database searching. Proc. Natl. Acad. Sci. 19, 6565-6572. 21
....This is evidenced by the growing efforts in recent years for building second generation (or secondary value added) databases that contain domains, motifs or patterns. Some examples include the SBASE protein domain library [Pongor et al. 1994] the BLOCKS database of aligned sequence segments [Henikoff Henikoff, 1991], the PRINTS database of protein motif fingerprints [Attwood et al. 1994] and the ProDom protein domain database [Sonnhammer Kahn, 1994] While several domain motif databases are being compiled, it is important to develop database search methods that fully utilize the conserved structural and ....
....The full length sequences were directly taken from the SwissProt database. The motif sequences used to compute the n gram weight factors were compiled by using our own string pattern matching program to search for PROSITE signatures (Table 1) and retrieve substrings in the BLOCKS format [Henikoff Henikoff, 1991]. MOTIFIND Evaluation Mechanism Evaluation mechanism. The system performance was evaluated based on speed (CPU time) and predictive accuracy. Accuracy was measured in terms of both sensitivity (ability to detect true positives) and specificity (ability to avoid false positives) at different ....
Henikoff, S. & Henikoff, J. G. (1991). Automated assembly of protein blocks for database searching. Nuc. Acids Res., 19,6565-6572.
....al. 8] such as weighted average or logarithmic weighting. Two main applications are derived from profiles: PROFILESEARCH searches for similarities of a multialignment with a set of sequences; PROFILESCAN searches for similarities of a sequence with a set of profiles. The protein blocks method [10] comes in filiation of profiles: a block is a short ungapped profile. To a family of homologous proteins corresponds an ordered set of blocks separated by unaligned regions. From all the families of proteins, Steven and Jorga Henikoff have built a database of blocks; the order relation induces a ....
Henikoff, S., and Henikoff, J. G. Automated assembly of proteins blocks for database searching. Nucleic Acids research 19, 23 (1991), 6565--6572.
....discovery of motifs has been a challenging theme in order to expand motif libraries. Two steps are required to construct motif library: 1) grouping sequences and 2) detecting common patterns or conserved regions in these sequence groups. There are several methods automating the second step. BLOCKS [2] is a well known example where conserved segments were automatically extracted from unaligned sequence sets. However, because these blocks were extracted only from sequence groups cataloged in PROSITE, it could not expand PROSITE catalog. It is important to develop a procedure to group sequences. ....
S. Henikoff, J. G. Henikoff, "Automated assembly of protein blocks for database searching" Nucleic Acids Res., Vol. 19, pp. 19-23, 1991.
....groups used in the patterns. It may be natural to assign probabilities proportional to the amino acid frequencies to characters from Sigma. For the group characters, theoretically substantiated probabilities can possibly be calculated from substitution matrices such as PAM [Day78] or Blosum [HH91] Alternatively we can use some heuristics giving preferences to the groups that we want to use. Even simpler heuristics is to assign the probability 0 to the groups we do not use, and to distribute the probabilities uniformly among all other characters. The probability of the pattern generated ....
Steven Henikoff and Jorja G. Henikoff. Automated assembly of protein blocks for database searching. Nucleic Acids Research, 19(23):6565--6572, 1991.
....as blocks of locally aligned sequence segments. Each block can be considered as a special type of pattern for the protein family. If a query sequence belongs to a family with multiple blocks, then at least a subset of these blocks should score highly when matching the query with the blocks [30, 33]. In this chapter, we present several algorithms to discover blocks for protein families. We focus on the 768 groups of related proteins documented in the PROSITE catalog v. 12.0 [7] which can be obtained from the ftp site ncbi.nlm.nih.gov under the directory repository prosite ) keyed to the ....
....is associated with a set of blocks; each block is obtained from ungapped aligned regions extracted from the sequences in the group. A best set of blocks is then selected using a program called PROTOMAT developed by Henikoff and Henikoff of the Howard Hughes Medical Institute, Seattle, Washington [30, 33] (which can be obtained from the ftp site ncbi.nlm.nih.gov under the directory repository blocks unix protomat ) All the selected blocks are then calibrated and concatenated into the BLOCKS database [30, 33] which can be obtained from the ftp site ncbi.nlm.nih.gov under the directory ....
[Article contains additional citation context not shown here]
S. Henikoff and J. G. Henikoff, "Automated assembly of protein blocks for database searching," Nucleic Acids Research, vol. 19, no. 23, pp. 6565-- 6572, 1991.
....databases has also begun to reduce the utility of these pairwise methods. Specifically, after adjusting for the large number of multiple comparisons, the comparison scores obtained by chance from random sequences are creeping into the range of the comparison scores for truly related sequences (Henikoff and Henikoff 1991; Claverie 1996) Two statistical models for multiple alignment have recently been developed: the blockmotif model that describes conserved regions in protein or DNA sequences as ungapped blocks (Lawrence and Reilly 1990; Lawrence et al. 1993; Liu 1994; Liu et al. 1995; Neuwald, Liu and Lawrence ....
Henikoff, S., and Henikoff, J.G. (1991), "Automated Assembly of Protein Blocks for Database Searching," Nucleic Acids Research, 19, 6565-6572.
....to the existing protein families. These attempts are extremely important in view of the large number of newly sequenced proteins which await analysis. Generally, the existing approaches can be divided into those based on short conserved motifs (e.g. Bairoch et al. 1997, Attwood et al. 1998, Henikoff Henikoff 1991] and those which are based on whole domains (e.g. Sonnhammer Kahn 1994, Sonnhammer et al. 1997] The manually defined patterns in PROSITE have served as an excellent seed for several such works. The methods used to represent these motifs and domains vary, and among the most popular forms are ....
....have served as an excellent seed for several such works. The methods used to represent these motifs and domains vary, and among the most popular forms are the consensus patterns [Bairoch et al. 1997, Attwood et al. 1998] the position specific scoring matrices (profiles) Gribskov et al. 1987, Henikoff Henikoff 1991] and the HMMs [Krogh et al. 1996] These forms differ in their mathematical complexity, as well as in their sensitivity selectivity. To model a motif, a domain or a protein family, many approaches start by building a multiple alignment. The prerequisite of a valid alignment of the input sequences ....
[Article contains additional citation context not shown here]
Henikoff, S. & Henikoff, J. G. (1991). Automated assembly of protein blocks for database searching. Nucl. Acids Res. 19, 6565-6572.
....using trusted alignments, we have fairly high confidence that each amino acid distribution we see is for amino acids from a single biological context, and not just an artifact of a particular search or alignment algorithm. Throughout this paper, the trusted alignments used are the BLOCKS database [HH91] with the sequence weighting scheme mentioned in Section 5. 2.1 Encoding cost The encoding cost (sometimes called conditional entropy) is a good measure of the residual variation among sequences of the multiple alignment. Since entropy is additive, the encoding cost for independent columns can ....
....can do for samples of size k: H min;k = Gamma 1 T X s;jsj=k jT s j X i T s (i) jT s j log 2 T s (i) jT s j = Gamma 1 T X s;jsj=k X i T s (i) log 2 T s (i) jT s j : Table 2. 1 shows this lower bound on average encoding cost of the columns of the Blocks multiple alignment [HH91] see Section 5 for details on how the database is used in this paper) given that we have sampled jsj amino acids from each column. A large encoding cost means that there is a lot of variation in which amino acids occur, while a small encoding cost means that a few amino acids have very high ....
[Article contains additional citation context not shown here]
Steven Henikoff and Jorja G. Henikoff. Automated assembly of protein blocks for database searching. NAR, 19(23):6565--6572, 1991.
..... 38 3 Large scale analyses of protein sequences 39 3.1 Motif and domain based analyses . 39 3.1.1 The PROSITE dictionary [Bairoch 1991] 39 3.1. 2 The BLOCKS database [Henikoff Henikoff 1991] . 40 3.1.3 The ProDom database[Sonnhammer Kahn 1994] 40 3.1.4 The Pfam database [Sonnhammer et al. 1997] 41 3.1.5 The DOMO database [Gracy Argos 1998] 41 3.2 ....
....from a single matrix PAM 1, the BLOSUM series of matrices was constructed by direct observation of sequence alignments of related proteins, at different levels of sequence divergence. The matrices are based on blocks a collection of multiple alignments of similar segments without gaps [Henikoff Henikoff 1991], each block representing a conserved region of a protein family. These blocks provide a list of (accepted) substitutions, and a log odds scoring matrix can be defined: s ab = log q ab e ab where q ab is the observed relative frequency of the pair a b given by q ab = f ab = P a P b f ab ....
[Article contains additional citation context not shown here]
Henikoff, S. & Henikoff, J. G. (1991). Automated assembly of protein blocks for database searching. Nucl. Acids Res. 19, 6565-6572.
....of a profile is a position dependent scoring matrix, giving one score to each amino acid for each position in a segment to match the profile) The profile is iteratively refined by realigning the sequences to the profile, throwing away non significant matches, and recalculating the profile. In (Henikoff and Henikoff, 1991), a combined PD and SD algorithm is developed for finding frequent blocks in protein databases. The first stage simply uses the algorithm of (Smith, et al. 1990) thus finding patterns and the respective blocks in a PD manner, and then extends them (see the beginning of this subsection) The ....
Henikoff, S., and Henikoff, J. G. 1991. Automated assembly of protein blocks for database searching. Nucl. Acids Res. 19(23):6565--6572.
....to all the methods that are used to access large data sets, so that the distinctions between data and methods, interface and algorithm, data structure and objectmodel become increasingly blurred. An example of the new class of database is the set of protein family databases (such as BLOCKS [HH91] PRINTS [AB94] and Pfam [SEB 98] which at one level are simply a clustering of the protein databases, but actually provide considerable added value in the form of annotation, links to other databases and to literature and crucially algorithms to make use of the contained information ....
S. Henikoff and J. G. Henikoff. Automated assembly of protein blocks for database searching. Nucleic Acids Research, 19:6565-- 6572, 1991.
....(spacers) between the subwords. Note that in principle this method could also be used the same way for the conservation problem, i.e. if only positive examples are given. However the negative examples allow to use shorter substrings p i;j as a starting point for the alignment. In Henikoff et al. HH91] a combined BU and TD algorithm is developed for finding frequent blocks in protein databases. The first stage is simply using the algorithm of [SAC90] thus finding patterns and the respective blocks in a BU manner, and then extending them (see the beginning of this subsection) The positions of ....
....is constructed, where nodes represent patterns, and an arc extends from node b 1 to b 2 if pattern b 1 precedes pattern b 2 and does not overlap in at least the critical number of sequences. After this all paths are searched and some scores are calculated (for details of the scoring scheme see [HH91] Wu et al. WB95] use the TD approach to extract patterns from (assumingly) correctly prealigned examples, in combination with a BU heuristic search for correctly grouping the examples in subclasses, and excluding the noise. In some sense this may be seen as an attempt to deal with the noise ....
S. Henikoff and J. G. Henikoff. Automated assembly of protein blocks for database searching. Nucleic Acids Research, 19(23):6565--6572, 1991.
....Each of the regions of high similarity describes a motif for which a model can be constructed. 1 The multiple motifs present in a family of sequences can be viewed as a 1 Computer algorithms such as MEME [ Bailey and Elkan, 1995 ] the Gibbs sampler [ Lawrence et al. 1993 ] and Protomat [ Henikoff and Henikoff, 1991 ] exist to assist in the automatic construction of motif models. pattern that defines the family. The presence or absence of each motif in a target sequence is evidence for or against its membership in the family. Since each motif gives an independent measure of membership in the family, ....
Steven Henikoff and Jorja G. Henikoff. Automated assembly of protein blocks for database searching. Nucleic Acids Research, 19:6565--6572, 1991.
....a larger scale, tools have been developed for comparisons that involve a small number of sequences [Gribskov et al. 1987, Taylor 1990] Only few computational studies considered all, or many, of the known sequences. These studies focus on (i) searching for motifs, signature sequences and domains [Henikoff Henikoff 1991, Sheridan Venkataraghavan 1992, Harris et al. 1992, Sonnhammer Kahn 1994, Han Baker 1995, Hanke et al. 1996] ii) improving mutation matrices [Gonnet et al. 1992, Henikoff Henikoff 1992] iii) automatic classification of protein sequences into families [Wu et al. 1992, Ferran et al. ....
Henikoff, S. & Henikoff, J. G. (1991). Automated assembly of protein blocks for database searching. Nucl. Acids Res. 19, 6565-6572.
....the comparison itself, whether it is performed globally or locally. In this paper we adopt a local approach that is also called peptide matching or block searching, where a block is an array of aligned individual sequence segments that are usually, but not necessarily, all of the same length [12] [13] [20] In this case, the definition of multiple rests on the comparison of the segments of a block. This comparison can be seen once again as an extension of a pairwise comparison and the same functions of similarity given above can be used locally to define the score of a block. In particular, ....
S. Henikoff and J.G. Henikoff. Automated assembly of protein blocks for database searching. Nucl. Acids Res., 19:6565--6572, 1991.
....number of sequences simultaneously (typically a hundred or more) using a different approach. The method adopted is that of peptide matching or block searching, where a block is defined as an array of local sequence segments, not necessarily of equal lengths, that are similar in a certain way [8] [9] [13] This method has been used by various authors previously with different definitions of similarity and for other purposes. Schuler s program [20] Macaw, is based on a SP (Sum of Pairs) approach to the scoring of a block of segments and has to resort to heuristics to reduce its search space ....
S. Henikoff and J.G. Henikoff. Automated assembly of protein blocks for database searching. Nucl. Acids Res., 19:6565--6572, 1991.
....CIIGKGRSYKGTVSITKSGIKCQPWSS UROK MOUSE CYHGNGDSYRGKANTDTKGRPCLAWNA PLMN MOUSE CYQSDGQSYRGTSSTTITGKKCQSWAA UROK CHICK TNSICYSGNGEDYRGMAEDPGCLYWDH Table 3: The conserved region in a set of kringle domain proteins which constitutes block BL00021A in Version 7. 01 (1993) of the Blocks database, (Henikoff Henikoff 1991). The Swissprot identifiers are shown to the left. where the gaps are not counted as letters. In an HMM the transition probabilities, that determine the gap penalties, are also estimated from the alignment. If the number of times the transition from state i to j is used in the alignment is called ....
Henikoff, S., and Henikoff, J. 1991. Automated assembly of protein blocks for database searching. Nucleic Acids Research 19(23):6565--6572.
....amino acids, the formula employed (equation 15 in Section 3.2) gives those components which are most likely to have generated the actual amino acids observed the greatest impact on the estimation. For example, in Tables 1 and 2 we give a nine component mixture estimated on the Blocks database (Henikoff and Henikoff, 1991) . In this mixture, isoleucine is seen in several contexts. Component 9 gives high probability to all conserved distributions (i.e. distributions where a single residue is preferred over all others) Component 6 represents distributions preferring isoleucine and valine, but allowing leucine and ....
.... parameter estimation, multiple alignments, and database searches, see (Krogh et al. 1994) 3 In more recent work, they have used 18 different distributions (Bowie et al. 1991) 2 Interpreting Dirichlet Mixtures We include in this paper a 9 component mixture estimated on the Blocks database (Henikoff and Henikoff, 1991) which has given some of the best results of any mixture estimated using the techniques described here 4 . Table 1 gives the parameters of this mixture. Since a Dirichlet mixture describes the expected distributions of amino acids in the data used to estimate the mixture, it is useful to look in ....
[Article contains additional citation context not shown here]
Henikoff, Steven and Henikoff, Jorja G. 1991. Automated assembly of protein blocks for database searching. NAR 19(23):6565--6572.
....Smith, 1994; Berger, 1985; Santner and Duffy, 1989) over amino acid distributions, and to combine this prior information with the observed amino acids to form more effective estimates of the expected distributions. Multiple alignments used in these experiments were taken from the Blocks database (Henikoff and Henikoff, 1991) . We use Maximum Likelihood (Duda and Hart, 1973; Nowlan, 1990; Dempster et al. 1977) to estimate these mixtures that is, we seek to find a mixture that maximizes the probability of the observed data. Often, these densities capture some prototypical distributions. Taken as an ensemble, they ....
....the typical distributions of amino acids in the data used to estimate the mixture, it is useful to look in some detail at each individual component of the mixture to see what distributions of amino acids it favors. We include in this paper a 9 component mixture estimated on the Blocks database (Henikoff and Henikoff, 1991) , a close variant of which has been used in experiments elsewhere (Tatusov et al. 1994; Henikoff and Henikoff, 1996) A couple of comments about how we estimated this mixture density are in order. First, the decision to use nine components was somewhat arbitrary. As in any statistical model, a ....
[Article contains additional citation context not shown here]
Henikoff, Steven and Henikoff, Jorja G. 1991. Automated assembly of protein blocks for database searching. NAR 19(23):6565--6572.
No context found.
Henikoff, S. & Henikoff, J. G. (1991). Automated assembly of protein blocks for database searching. Nucl. Acids Res. 19, 6565 -- 6572.
No context found.
Henikoff, S., Henikoff, J.G. Automated assembly of protein blocks for database searching. Proc. Natl. Acad. Sci. USA 19:6565--6572, 1993.
No context found.
Steven Henikoff and Jorja G. Henikoff. Automated assembly of protein blocks for database searching. NAR, 19(23):6565--6572, 1991.
No context found.
Steven Henikoff and Jorja G. Henikoff. Automated assembly of protein blocks for database searching. NAR, 19(23):6565--6572, 1991.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC