How to find gaps without gap penalty?

Roytberg M.A., Semionenkov M.N., Tabolina O.Yu.

Institute of the mathematical problems in biology, 142292, Pushchino, Russia. E-mail: roytberg@impb.serpukhov.su

Sequence alignment is always based (i) on the substitution weight matrix (that has some biological motivation) and (ii) on the gap penalties (which are pre-determined from experience, even in the absence of any guarantee that they must be the same for all sequences). Our approach to constructing alignments without any pre-determined gap penalties is based on the following observation.

The gaps in the "random" and in the related sequences play fundamentally different roles. If two random sequences are bing aligned, a new gap does not usually lead to a considerable gain in the number of matches. On the other hand, correct gaps in related biological sequences reconstitute matching of rather long similar segments and, therefore, the number of matches increases substantially. Thus, an alignment with all "biologically correct" gaps corresponds to the "break point" of the plot which shows the maximal total substitution weight S(g) as a function of the allowed number of gaps g = 0, 1, ....

The approach is tested on artificial (random) sequences, on the protein sequences of different level of similarity, and on comparison of protein sequences with profiles. For the tested proteins, the constructed alignments are very close to those based on their 3D-structures. The algorithm computing the S(g) values and corresponding alignments is based on the dynamic programming applied to the vector alignment weights rather than usual scalar weights (here p is a gap penalty).