BLOSUM

From Wikipedia, the free encyclopedia

The BLOSUM62 matrix

BLOSUM (BLOcks of Amino Acid SUbstitution Matrix^[1]) is a substitution matrix used for sequence alignment of proteins. BLOSUM are used to score alignments between evolutionarily divergent protein sequences. Blosum is based on local alignments. Blosum was first introduced in a paper by Henikoff and Henikoff.^[2] They scanned the BLOCKS database for very conserved regions of protein families (that do not have gaps in the sequence alignment) and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitutions of the 20 standard amino acids. All BLOSUM are based on observed alignments; they are not extrapolated from comparisons of closely related proteins.

Several sets of BLOSUM exist using different alignment databases, named with numbers. BLOSUM with high numbers are designed for comparing closely related sequences, while BLOSUM with low numbers are designed for comparing distant related sequences. For example, BLOSUM80 is used for less divergent alignments, and BLOSUM45 is used for more divergent alignments. Scores within a BLOSUM are log-odds scores that measure, in an alignment, the logarithm for the ratio of the likelihood of two amino acids appearing with a biological sense and the likelihood of the same amino acids appearing by chance.^[3] The matrices are based on the minimum percentage identity of the aligned protein sequence used in calculating them.^[3] Every possible identity or substitution is assigned a score based on its observed frequences in the alignment of related proteins.^[4] A positive score is given to the more likely substitutions while a negative score is given to the less likely substitutions.

BLOSUM62 is the matrix calculated by using the observed substitutions between proteins which have at least 62% sequence identity, and has become a standard for alignment software.

BLOSUM has proved better at scoring distantly related sequences than the once-widely-used Point Accepted Mutation (PAM) matrices. To calculate a matrix for BLOSUM, the following equation is used: $S_{ij}= \left( \frac{1}{\lambda} \right)\log{\left( \frac{p_{ij}}{q_i * q_j} \right)}$

Here, $p i j$ is the probability of two amino acids $i$ and $j$ replacing each other in a homologous sequence, and $q i$ and $q j$ are the background probabilities of finding the amino acids $i$ and $j$ in any protein sequence at random. The factor $λ$ is an important scaling factor ^[5], set to make sure that the matrix contains easily readable integer values.

[edit] References

^ Note that in the acronym BLOSUM the last 'M' stands for 'matrix' and it is therefore incorrect and unnecessary to write 'BLOSUM matrix', see RAS syndrome.
^ Henikoff, S. (1992). "Amino Acid Substitution Matrices from Protein Blocks". PNAS 89: 10915–10919. doi:10.1073/pnas.89.22.10915. PMID 1438297.
^ ^a ^b Albert Y. Zomaya (2006). Handbook of Nature-Inspired And Innovative Computing. ISBN 0387405321. page 673
^ NIH "Scoring Systems"
^ The scaling factor $λ$ is defined (see above) by $\lambda * S_{ij} = log_e{\left( \frac{p_{ij}}{q_i * q_j} \right)}$ and is calculated by solving the equation system $\sum_{i=1}^{n} \sum_{j=1}^{i} {q_i * q_j * e^{\lambda * S_{ij}} }=1$ solving for $λ$ .
This is true since by definition $\sum_{i=1}^{n} \sum_{j=1}^{i} {q_i * q_j * e^{\lambda * S_{ij}} }=\sum_{i=1}^{n} \sum_{j=1}^{i} {p_{ij}}$ and logically the sum of all probabilities is $\sum_{i=1}^{n} \sum_{j=1}^{i} {p_{ij}} = 1$