Why is it so important for bioinformatics to get alignments? Where are the problems and how can they be solved?
This manuscript gives a short overview about some of the methods to analyse sequences as well as the Needleman- Wunsch and the Smith-Waterman Algorithm.
You can get an Overview how to interpret a Dotplot. Also you can learn how to create global and local alignments.
ABSTRACT
Motivation:
Why is it so important for bioinformatics to get alignments? Where are the problems and how can they be solved?
Results:
This manuscript gives a short overview about some of the methods to analyze sequences as well as the Needleman-Wunsch and the Smith-Waterman Algorithm.
1 Introduction
“Sequence Alignment is the comparison of two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequence.” [1] This can show the functional, structural and evolutionary peculiarities of these sequences. So the question arises: How many modifications are necessary to change one String to another String using insertion, deletion or substitution? This is called Edit- or Hamming- Distance. The Edit-Distance is used for same-length Strings, while the Hamming-Distance is used for different-length Strings. For example: The Edit-Distance of BIOLOGY and ORLOGIC is seven. Seven times substitution, as there is no equal character. In instance for the Hamming-Distance BIOLOGIES and ORLOGICS. These two words have different lengths, so it is possible to insert gaps. The result is: BIOLOGIES and ORLOGICS with six operations including one insertion gap and five substitutions. The given examples are very small ones. There is the problem. If there are very long Strings, in the case of bioinformatics sequences, it takes the computer too long to compare every possible alignment. There are always nm possible matches using n as the number of characters in sequence one and using m as the num-ber of characters in sequence two. In this case there are 77 possible matches. For such cases there are algorithms to fasten up the calculations. The Needleman-Wunsch and the Smith-Waterman Algorithm are fast, but there are faster algorithms like the Gotoh. The problem is that this score is only reasonable to use for this Strings not necessarily for biology. It´s important to know that the best matching does not always represent the matching expected in nature. For this bioinformatics uses the substitution matrices like PAM or BLOSUM. They represent the probability of switching amino acids. Parameter like the dimensions, charge or polarity can be important for switching amino acids in evolution. This is not the only way to analyze two sequences. Another way is to create a Dotplot. A Dotplot can be used to see how two sequences are set up.
2 Dotplot
Setting up a Dotplot is easy. It is necessary to create a table. Then on the horizontal side the first sequence and on the vertical side the second sequence must be written. Afterwards every box with two similar characters is marked with a dot.
[Figure is omitted from this preview]
Figure 1: Dotplot2 [2]
At the top there is the first sequence to be analyzed and at the left vertical column there is the second sequence. In the middle there are the dots for each equal character.
After every dot is placed, filters can be used. For example every dot without two dots next to it is deleted. So the back-ground noise is insulated and the two sequences can be analyzed by patterns. This is just a very weak filter. For genomes or genes stronger filters are in use.
For analyzing the pattern there are some motifs, which are very important to know. They are shown in Figure 2. If there is just a diagonally line of points from the top to the bottom the sequences are identical (Figure 2 a). If there is a diagonally line of points from the top to the bottom and there are two or more, but even numbers, of shorter diagonally lines there are repeats in those sequences (Figure 2 b). If there is the diagonally line and two otherwise diagonally lines it means there are reverse repeats (Figure 2 c). Furthermore if there is a black space it means there are microsatellites (Figure 2 d). Microsatellites are short tandem repeats of noncoding DNA having the length of 1-6 nucleotide pairs.
If it is hatching it means, that there are mini satellites (Figure 2 f). Mini satellites are short tandem repeats as well, but having the length of 10-50 nucleotide pairs. If the diagonally line has some gaps it means there was some sequence replacements (Figure 2 g). If the diagonally line has a break with a shift it means there was an insertion or a deletion (Figure 2 h). Sequence replacements, insertions and deletions are mostly the result of evolution.
[Figure is omitted from this preview]
Figure 2: Dotplot patterns [3]
In this picture there are examples for typical patterns in Dotplots.
Usually in a single Dotplot there is more than just one pat-tern. This was a global Dotplot. It is also possible to do a local Dotplot. For example a gene compared to its genome. The gaps and shifts here results from the introns / exons in a eukaryote gene. Introns are removed from the gene during the gene splicing. So the introns cannot be found in the gene itself.
This method gives a lot of information, especially for the evolutionary origin and the structure. The Dotplot is an easy diagram, but it shouldn’t be ignored as it can display obvious important information which probably could be missed. [4]
[…]
[1] David W. Mount: Bioinformatics Sequence and Genome Analysis Second Edition page 70
[2] http://webclu.bio.wzw.tum.de/binfo/edu/tutorials/pairalign/make_dp.html
[3] http://code10.info/index.php?option=com_content&view=article&id=64:introduction-to-dot-plots&catid=52:cat_coding_algorithms_dot-plots&itemid=76
[4] compare Arthur M. Lesk: Bioinformatik – Eine Einführung: page 166
- Quote paper
- Markus Hoffmann (Author), 2016, Pairwise Alignment. Global and Local, Munich, GRIN Verlag, https://www.grin.com/document/346638