Probalign is a sequence alignment tool that calculates a maximum expected accuracy alignment using partition function posterior probabilities.[1] Base pair probabilities are estimated using an estimate similar to Boltzmann distribution. The partition function is calculated using a dynamic programming approach.
The following describes the algorithm used by probalign to determine the base pair probabilities.[2]
To score an alignment of two sequences two things are needed:
- a similarity function
(e.g. PAM, BLOSUM,...)
- affine gap penalty:
data:image/s3,"s3://crabby-images/281a7/281a7730954563d2923b5d5aca3a706a66b5387d" alt="{\displaystyle g(k)=\alpha +\beta k}"
The score
of an alignment a is defined as:
Now the boltzmann weighted score of an alignment a is:
Where
is a scaling factor.
The probability of an alignment assuming boltzmann distribution is given by
Where
is the partition function, i.e. the sum of the boltzmann weights of all alignments.
Dynamic programming
[edit]
Let
denote the partition function of the prefixes
and
. Three different cases are considered:
the partition function of all alignments of the two prefixes that end in a match.
the partition function of all alignments of the two prefixes that end in an insertion
.
the partition function of all alignments of the two prefixes that end in a deletion
.
Then we have:
The matrixes are initialized as follows:
data:image/s3,"s3://crabby-images/9a9b5/9a9b59cfebd2f864afa02f88067e088f43a9ef68" alt="{\displaystyle Z_{0,j}^{M}=Z_{i,0}^{M}=0}"
data:image/s3,"s3://crabby-images/773bc/773bc37a0ff075b63f355c72a0fa25e84dc0cb94" alt="{\displaystyle Z_{0,0}^{M}=1}"
data:image/s3,"s3://crabby-images/2b4f0/2b4f0a646ef8f700db4dea59591d7a9f2c93db7f" alt="{\displaystyle Z_{0,j}^{D}=0}"
data:image/s3,"s3://crabby-images/ee5f3/ee5f39ea6e8e0937a75911fdc850102ec24947b6" alt="{\displaystyle Z_{i,0}^{I}=0}"
The partition function for the alignments of two sequences
and
is given by
, which can be recursively computed:
data:image/s3,"s3://crabby-images/6664e/6664e22e993218f69654eefbce89d1dcdb086041" alt="{\displaystyle Z_{i,j}^{M}=Z_{i-1,j-1}\cdot e^{\frac {\sigma (x_{i},y_{j})}{T}}}"
data:image/s3,"s3://crabby-images/06651/0665145a55b74b6c90bec4661dccf383bf99eaa2" alt="{\displaystyle Z_{i,j}^{D}=Z_{i-1,j}^{D}\cdot e^{\frac {\beta }{T}}+Z_{i-1,j}^{M}\cdot e^{\frac {g(1)}{T}}+Z_{i-1,j}^{I}\cdot e^{\frac {g(1)}{T}}}"
analogously
Base pair probability
[edit]
Finally the probability that positions
and
form a base pair is given by:
are the respective values for the recalculated
with inversed base pair strings.