Pseudo amino acid composition
From Wikipedia, the free encyclopedia
| This article or section needs to be wikified to meet Wikipedia's quality standards. Please help improve this article with relevant internal links. (March 2008) |
Pseudo amino acid composition, or PseAA composition, was originally introduced by Professor Kuo-Chen Chou [1] in 2001 to represent protein samples for statistical prediction. In contrast with the conventional amino acid (AA) composition that contains 20 components with each reflecting the occurrence frequency for one of the 20 native amino acids in a protein, the PseAA composition contains a set of greater than 20 discrete factors, where the first 20 represent the components of its conventional AA composition while the additional factors incorporate some sequence-order information via various modes. Typically, these additional factors are a series of rank-different correlation factors along a protein chain, but they can also be any combinations of other factors so long as they can reflect some sorts of sequence-order effects one way or the other. Therefore, the essence of PseAA composition is that on one hand it covers the AA composition, but on the other hand it contains the information beyond the AA composition and hence can better reflect the feature of a protein sequence through a discrete model. Ever since the concept of PseAA composition was introduced, it has been widely utilized to predict various protein attributes, such as protein subcellular localization, membrane protein type, enzyme functional class, GPCR type, protease type, protein structural class, and protein secondary structural content, among many others (see the references cited in [2] and [3]). Meanwhile, various different modes to formulate the PseAA composition have also been developed [2].
[edit] Background
In the history of developing methods for predicting subcellular localization of proteins and their other attributes, two kinds of models were generally used to represent protein samples: (1) the sequential model, and (2) the non-sequential model or discrete model. The most typical sequential representation for a protein sample is its entire amino acid sequence, which can contain its most complete information. This is an obvious advantage of the sequential model. To get the desired results, the sequence-similarity-search-based tools are usually utilized to conduct the prediction. However, this kind of approach failed to work when a query protein did not have significant homology to the attribute-known proteins. Thus, various discrete models were proposed.
The simplest discrete model is using the AA (amino acid) composition to represent protein samples, as formulated as follows. Given a protein sequence P with L amino acid resides, i.e.,

where R1 represents the 1st residue of the protein P, R2 the 2nd residue, and so forth, according to the AA composition model, the protein P of Eq.1 can be expressed by

where
are the normalized occurrence frequencies of the 20 native amino acids in P, and T the transposing operator. Owing to its simplicity, the AA composition model was widely used in many earlier statistical methods for predicting protein attributes. However, all the sequence-order information would be lost by using the AA composition to represent a protein. This is its main shortcoming. To avoid completely losing the sequence-order information, the concept of PseAA (pseudo amino acid) composition was proposed by Professor Kuo-Chen Chou [1]. According to the PseAA composition model, the protein P of Eq.1 can be formulated as

where 20 + λ the components are given by

where w is the weight factor, and τk the k-th tier correlation factor that reflects the sequence order correlation between all the k-th most contiguous residues (Fig.1) as formulated by

with
![\mbox{J}_{i, i+k} = \frac{1}{\Gamma} \sum_{g=1}^{\Gamma} \left[\Phi_{\xi}\left(\mbox{R}_{i+k}\right) - \Phi_{\xi}\left(\mbox{R}_{i}\right ) \right]^2
\qquad \mbox{(6)}](../../../../math/3/8/c/38cc41a8118f21cd879138a00e64fbf7.png)
where
is the ξ-th function of the amino acid
, and
the total number of the functions considered. For example, in the original paper by Professor Kuo-Chen Chou[1] ,
,
and
are respectively the hydrophobicity value, hydrophilicity value, and side chain mass of amino acid
; while
,
and
the corresponding values for the amino acid
. Therefore, the total number of functions considered there is
. It can be seen from Eq.3 that the first 20 components, i.e.
are associated with the conventional AA composition of protein , while the remaining components
are the correlation factors that reflect the 1st tier, 2nd tier, …, and the
-th tier sequence order correlation patterns (Fig.1). It is through these additional
factors that some important sequence-order effects are incorporated.
[edit] Web server
Note that
in Eq.3 is a parameter of integer and that choosing a different integer for
will lead to a dimension-different PseAA composition [2]. Also note that using Eq.6 is just one of the modes for deriving the correlation factors or PseAA components. The others, such as the physicochemical distance mode [4] and amphiphilic pattern mode [5], can also be used to derive different types of PseAA composition. In 2008 a free server called PseAAC [6] is provided at the website http://chou.med.harvard.edu/bioinf/PseAAC/. By using the web server, users can generate the PseAA composition for any given protein sequence by selecting the mode as desired.
Figure 1. A schematic drawing to show (a) the 1st-tier, (b) the 2nd-tier, and (3) the 3rd-tier sequence-order-correlation mode along a protein sequence, where R1 represent the amino acid residue at the sequence position 1, R2 at position 2, and so forth, and the coupling factors J i,j are given by Eq.6. Panel (a) reflects the correlation mode between all the most contiguous residues, panel (b) that between all the 2nd most contiguous residues, and panel (c) that between all the 3rd most contiguous residues. Adapted from [1] with permission.
[edit] References
[1] Kuo-Chen Chou, Prediction of protein cellular attributes using pseudo amino acid composition, PROTEINS: Structure, Function, and Genetics (Erratum: ibid., 2001, Vol.44, 60) 43 (2001) 246-255.
[2] Kuo-Chen Chou, Hong-Bin Shen, Review: Recent progresses in protein subcellular location prediction, Analytical Biochemistry 370 (2007) 1-16.
[3] Kuo-Chen Chou, Hong-Bin Shen, Cell-PLoc: A package of web-servers for predicting subcellular localization of proteins in various organisms, Nature Protocols 3 (2008) 153-162.
[4] Kuo-Chen Chou, Prediction of protein subcellular locations by incorporating quasi-sequence-order effect, Biochemical & Biophysical Research Communications 278 (2000) 477-483.
[5] Kuo-Chen Chou, Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics 21 (2005) 10-19.
[6] Hong-Bin Shen, Kuo-Chen Chou, PseAAC: a flexible web-server for generating various kinds of protein pseudo amino acid composition, Analytical Biochemistry 373 (2008) 386-388.


