This presentation is focused on methods of clustering: a way of recognizing similarity between elements. The elements of our study are either individuals or populations for which we collect characters like alleles, DNA sequences, etc.

Slide 2

Organization of presentation

In this presentation we will see methods of phylogenetic reconstruction, concepts of clustering with discrete characters, multivariate analysis and networks. Among the methods for phylogenetic reconstruction we will see Distance methods, Maximum Likelihood and Maximum Parsimony.

Side 3

Characters and other concepts

Experimental data are generated by a wide variety of techniques, some of which were introduced in Class II. The data generated are called characters. The main criteria for selection of characters are their independence and homology. Statistically, characters are called variables.

Characters can be discrete (quantitative) or continuous (qualitative). In this class we will focus only on discrete characters. These can be broadly grouped into multistate characters (with 3 or more possibilities) , or binary characters (only two possibilities). The latter can be exemplified by RAPD and AFLP data (Class II, Slide 12) as well as by Restriction Enzymes recognition sites on RFLP analysis (Class II, Slides 5 and 7) : they are either present (1) or absent (0). Multistate characters can be exemplified by multi-allelic loci, with as many states as alleles; and DNA sequences, where each site can be represented by either of the 4 options (A, T, C or G).

In this course we will not present qualitative characters. Examples of continuous characters are, e.g., height in people, skin color, etc. Levels of gene expression in different environments could be taken as qualitative data.

Characters are presented in matrixes. Here we present an example of a matrix with binary data, with 5 elements and 6 variables or characters. This is a typical example of RAPD-type (random amplified polymorphic DNA) or RFLP-type data (restriction fragment length polymorphism).

Slide 4

DNA sequence characters

The phylogenetic analysis of DNA sequences requires the identification of homology among genes and among positions (sites) of genes / alleles among different individuals or species.

DNA sequences. The aligning of DNA sequences is the hypothesizing of a homologous relationship for each nucleotide base position (Mindell, 1991). Alignments are relatively easy for protein-coding genes. Non-protein coding sequences such as mtDNA control region (also called D-loop) and rDNA genes usually suffer from insertions or deletions due to the lack of selective pressure to maintain a coding frame (Class I, Slide 29). Therefore, the alignment of these type of DNA sequences usually requires the addition of gaps indicating a position where an insertion or deletion event took place. These sites are called indels (Class I, Slide 30).

In a practical example, when we are obtaining new DNA sequences for our species or populations, the first validation of the newly obtained sequence is its comparison to other similar sequences deposited in a database, e.g. GeneBank. (http://130.14.29.110/BLAST/). This search engine recovers the most similar sequences in the database. If we are working on a protein-coding gene we can do either of two searches: "blastn", which recovers matches between our DNA sequence (called QUERY) and other DNA sequences in the database (subjects); or blast a translated version of our DNA sequence and recover other translated sequences (blastx search). If our sequence is non-coding such as rDNA, tRNA, ITS or D-loop we use the first example.

In the following slide we show an example of GeneBank BLAST search.

Reference:

Midell DP (1991). Aligning DNA sequences: homology and phylogenetic weighting. In: Phylogenetic analysis of DNA sequences, ed. By Miyamoto MM and Cracraft J. Oxford University Press.

Slide 4

Blastn

In this slide we show an example of blastn search in the database "GeneBank" using a mtDNA ND2 fragment of the frog Microbatrachella capensis.

The list shows the species that align best to our DNA sequence. All the species are frogs, therefore we can be confident with our result and discard the possibility of contamination (alignment to , e.g, fly or human DNA would imply contamination).

By clicking on the GeneBank accession number, we obtain the complete information of that sequence in the database. By clicking on the score, we get the specific alignment to that species. This is shown in Slide 6.

Slide 6

Blastn result

This is the view we get after clicking on the bit score values. We see the alignment of our sequence to the sequence deposited in GeneBank and additional information on the matching fragment. The alignment shows the proportion of matches, matches in a graphical view, and information on which strand of the coding sequence was aligned (or sequenced). In this case, we see a Plus/Minus, indicating that we sequenced the opposite strand to that deposited in GeneBank. The following step, if working with coding regions, is to obtain the reverse-complement of this sequence and translate it, to confirm protein-sequence identity, and if needed , the position of the mutations (if synonymous or non-synonymous).

Slide 7

Phylogenetic reconstruction

Distance methods

Distance methods imply the transformation of a matrix of individuals (or independent variable) x characters (or dependent variable). This matrix is, in turn, transformed into a matrix that contains information of similarity or distance between the independent variables (individuals). Following, we will see a few examples of Distance criterion for binary and DNA data.

Slide 8

Distance method for binary data

Among the most utilized phenetic distance for binary data are Jaccard's and Manhattan distances. The first one is simply the proportion of shared elements over the total between all pairs of elements. It is expressed as

J = a / (a + b + c)

Where a = bands common to a and b

b = bands exclusive to a

c = bands exclusive to b

Manhattan distance: the distance between two points is the absolute distance between two points P1 and P2 the sum of the absolute difference between their coordinates (x1, y1 and x2,y2). The yellow, blue and red lines in the picture have the same length. The diagonal represents the Euclidean distance.

Euclidean distances joins any two points in the space following the formula:

? (x1-x2) 2 + (x2-y2) 2

In the picture, the pink line represents the Euclidean distance between the elements P1 and P2.

http://en.wikipedia.org/wiki/Manhattan_distance

Slide 9

Distances with nucleotide data

The simplest distance measure is p, which is the number of nucleotides that differ between 2 sequences divided by the total number of nucleotides compared.

In the present slide we show models of DNA base change. For this we utilize a 4 x 4 divergence matrix Fxy that shows the relative frequency of each nucleotide (or amino acid) pair in a given alignment between two sequences X and Y.

fAA fAC fAG fAT

fCA fCC fCG fCT

fGA fGC fGG fGT

fTA fTC fTG fTT

Slide 10

Models of DNA substitution

Jukes and Cantor models: assumes that the substitution rate is the same for all possible pairs

D = 1 - ( a + f + k + p)

Dxy = - ¾ ln (1- 4/3 D)

F81: the maximum d expected from J&C is 0.75. If this value is exceeded, the distance becomes undefined. F81 relaxes the conditions of equal base frequencies :

B = 1 - ( ?2A + ?2C + ?2G + ?2T)

dxy = - B ln (1- D/B)

K2P model accounts for differences between transition and transversion rates (Slide 28, Class I).

P = c + h + i + n

Q = b + d + e + g + j + l + m + o

Dxy = ½ ln (1/ ((1-2P-Q))

+ ¼ ln (1/(1-2Q))

There are other more complex models of substitution that take into account unequal base frequencies variation of K2P (HKY model); and unequal substitution rates for transitions between purines and pirimidines (e.g. Tajima Nei 1993), and other more complex methods that are derived from the basic methods presented here.

Slide 11

Distances for diploid data

Nei's 1972 distance is based on the infinite alleles model of mutation, in which there is a rate of neutral mutation and each mutant is a completely new allele. It is assumed that all loci have the same rate of neutral mutation, and that the genetic variability initially in the population is at equilibrium between mutation and genetic drift, with the effective population size of each population remaining constant.

It is described by the formula Dn = -ln I, or

Dn -ln (Jxy ? JxJy )

Where xi and yi are the frequencies of the allele i in the taxa x and y.

Jx = ?xi2

Jx = ?yi2

Jxy = ?xiyi

I varies between 0 and 1 and Dn varies between 0 and infinite.

This distance is influenced by within-taxon heterozygosity.

Cavalli- Sforza is an Euclidean measure distance that assumes that there is no mutation, and that all gene frequency changes are by genetic drift alone. However they do not assume that population sizes have remained constant and equal in all populations. They cope with changing population size by having expectations that rise linearly not with time, but with the sum over time of 1/N, where N is the effective population size. Thus if population size doubles, genetic drift will be taking place more slowly, and the genetic distance will be expected to be rising only half as fast with respect to time. This measure overcomes the limitations of the Nei 1972 distance.

Darc = ? (1/L) ? (2?/?)2

? = cos-1 ? ?xiyi

http://evolution.genetics.washington.edu/phylip/doc/gendist.html

Slide 12

Phylogenetic reconstruction- Criterion for distance data

We are going to present only two main methods of phylogenetic reconstruction from distance matrixes: one results in an ultrametric tree, and the other in an additive tree.

The first one is exemplified by the UPGMA (unweighted pair group method using arithmetic averages) and the second one the Neighbor joining trees. UPGMA trees assume a root where the distance to the common ancestor between two taxa is equal at a constant rate of evolution.

In NJ trees, the distance between taxa is the sum of the branches that connects them without assumptions about rooting. The lengths of the branches in the additive trees represent evolutionary distances.

Slide 13

Maximum likelihood methods

This method evaluates a hypothesis about evolutionary history in terms of the probability that the proposed model of evolutionary process (Slide 10), this class would give origin to the obtained data. The result with the highest likelihood is the preferred one.

More formally, given some data D, and a hypothesis H, the likelihood of the data is given by

LD = Pr (D?H) , which is the probability of obtaining D given H.

The likelihood for a particular site is the sum of the probabilities of every possible reconstruction of ancestral states given the models of substitution utilized. The likelihood of a tree is the product of the likelihood at each site. The likelihood is evaluated by summing the log of the likelihood at each site, and reported as the log likelihood of the tree

L = L1 x L2 x L3...x LN. = ? Lj

LnL = ln L1+ ln L2 + .... LN = ? ln Lj

Slide 14

Hypothesis testing - Likelihood Ratio Test.

We can test alternative hypotheses concerning the same data using the LRT (likelihood ratio test). Because the likelihoods are usually very small, we use log likelihoods. First, we define a null hypothesis H0 and an alternative hypothesis H1.

? = log L1 - log L0

where L1 and L0 are the maximum likelihood for the alternative hypothesis and the null hypothesis respectively.

For nested hypotheses, 2? is approximately distributed following a ? 2 distribution with degrees of freedom equal to the difference in the number of parameters between the two hypotheses.

These LRT can be used to test a model of substitution or rate variation. In the latter case, we test whether an ultrametric tree is significantly different from a model-based tree. If so, we can accept a molecular clock hypothesis. If sequences evolve at different rates, an ultrametric tree would be a poor representation of the relationship among sequences, and an additive tree would be, instead, more appropriate.

Slide 15

Bootstrapping

After constructing the tree, we need to know the reliability of the reconstructed branches or groups. One of the most extended method to assess how well supported is a particular branch is the parametric bootstrapping (or just bootstrapping). It consists of a pseudo-replication of the character matrix with replacement to create new matrixes of the same size as the original. The proportion (or frequency) at which a branch is found upon analysis of the pseudo-replicates is called "bootstrap support".

We generally accept bootstrap support over 70%, and lower values indicate poor support.

In the example, we see a phylogenetic reconstruction of the trumpet fish species Aulostomus. All clades show high bootstrap support. A corresponding network over imposed to the geographic.

Image: www.nps.gov/kaho/KAHOckLs/KAHOreef/trumpet2.htm

Reference:

Bowen B. W., Bass A. L., Rocha L. A., Grant W. S., Robertson D. R. (2001) phylogeography of the trumpetfishes (Aulostomus): ring species complex on a global scale. Evolution, 551029-1039.

Slide 16

Maximum Parsimony

The basic concept of this method is that the phylogenetic trees are reconstructed using the minimum number of steps, thus reducing the risk of homoplasy (equal state but unequal origin, e.g. A at a site that originated from T in one individual and from C in another).

By using an outgroup, the tree is rooted, and we assume that the ingroup taxa is monophyletic. The outgroup gives polarity to the character change in the ingroup.

The example shows 4 sequences and the unrooted tree ((1,2), (3,4)), and two possible reconstructions of the evolution of the first site. Under parsimony criteria, the tree that requires 1 change is preferred over the tree that requires 5 changes. Following we will see the steps required for the other sites.

Slide 17

Maximum Parsimony -example

Here we see the continuation of the example of the previous slide, the tree ((1,2),(3,4)). Site 2 requires 1 step, whereas two equally likely possibilities of two steps are encountered for site 3. Site 4 requires 1 step. No changes are required for site 5.

The total length of a tree is the summary of all the changes required for all sites. It is expressed as the sum of all sites k lengths l.

L = ?ki=1 li

Slide 18

Maximum Parsimony -example

Here, we see which of all the possible tree topologies would be accepted by the MP criterion. The table shows three possible topologies for 4 taxa, and counts the total number of mutations (or steps) required to agree with the hypothesised topology. The minimal length is recovered for the topology of our example, therefore the tree ((1,2), (3,4)) is accepted as the most parsimonious.

The site 4 and 5 are invariant under different tree topologies. Invariant sites , or sites where only one sequence has a different nucleotide are called phylogenetically informative

Slide 19

Networks

Some biological phenomena can be better represented by networks, which allow cycles and reticulation in the graphic representation. These phenomena include recombination events l(Class 1, Slide 6); hybridisation between lineages, processes of horizontal gene transfer (transfer of DNA between species) by retrotransposition (Slide 22, Class 1), polyploidization (Slide 32, ClassI). Other causes of reticulation in some data are homoplasy and recurrent mutations.

Networks can be more appropriate to describe intra-specific phylogenies, where the ancestor is still alive and can present alternative evolutionary paths in the form of cycles. See an example in Slides 26, 28, Class II; and Slide 8, Class III..

Among the most utilized networks are the Minimum Spanning trees, the shortest subset of edges that keeps the graph in one connected component. It is utilized, e.g., in wiring for telephonic companies. There are several available algorithms to connect the elements.

In the slide we show an example of hybridisation represented by reticulation. Nodes are represented by numbers. At a time of nodes 2 and 3 and 4, lineages hybridize and combine their genetic information in node 4. The hybrid lineages continues in node 6, and contains two different independent evolutionary histories.

The diagram in this slide depicts the simplest possible reticulation to generate a new lineage by hybrid speciation. The nodes are numbered for easy reference. Here, the otherwise independent lineages that were generated by a normal speciation event at the root of the tree-leading to the independently evolving black and green DNA sequences-have hybridized. The diagram in this slide depicts the simplest possible reticulation to generate a new lineage by hybrid speciation. The nodes are numbered for easy reference. Here the otherwise independent lineages that were generated by a normal speciation event at the root of the tree-leading to the independently evolving branches.

http://www2.toki.or.id/book/AlgDesignManual/BOOK/BOOK4/NODE161.HTM

Slide 20

Multivariate Analysis

Multivariate methods aim to detect general patterns and to indicate potentially interesting relationships in data. The most valuable methods are those that display their results graphically, rather than just examining statistical properties derived from the data.

Multivariate statistics or multivariate statistical analysis in statistics describes a collection of procedures which involve observations and analyses of more than one statistical variable at a time.

Among the most utilized methods of multivariate analysis is PCA or Principal Components Analysis and MDS (Multidimensional Scaling). The basis of these methods is that a matrix of phylogenetic distances between an OUT's (operational taxonomic unit or independent variables) can be used to position the OUT's in a one dimensional space.

Ordination methods, such as PCA or MDS, group taxa with shared properties. The different coordinates usually correlate with different properties of the OTUs, and, if the ordination was calculated from quantitative or multi-state characters, such as sequences, or alleles, it is easy to calculate the correlation between each dimension and the original data. The major trends in the data set are expressed by the first few coordinates, while the finer details between closely related taxa are often hidden in minor coordinates; over two-thirds of the information in a well-structured data set, such as the phylogenetic relationships of a group of OUT's with clear lineages, will be represented by the first three variables.