R Hackathon 1/TransitionProbability

From Phyloinformatics
Jump to: navigation, search

Modeling Discrete Character Evolution in R

The evolution of a discrete character along a branch of a phylogeny can be modeled as a Markovian process, where the probability of moving the current state to a different state is governed by a rate matrix. We can learn about discrete character evolution by calculating and comparing the likelihoods under models which impose different requirements on this matrix. For example, we might compare a Jukes-Cantor model (where the probability of going from state 0 to state 1 is equal to the probability of going in the reverse direction) to an asymetrical model where the two rates are allowed to differ. If we found the latter model was a significantly better fit, we might conclude that the trait in question evolves in a directional fashion.

Estimating transition rates and calculating the likelihood of different models for discrete character evolution can be accomplished with the packages geiger (the fitDiscrete function) and ape (the ace function).

Estimating a Jukes-Cantor model with the geiger package (load the package before you begin)

We will use the Geospiza dataset for this example and we assume that you have already loaded your tree into the workspace as a phylo object (see Inputting Trees).

So first let's create a vector with character states for "Geospiza". The dataset has 14 species. We can see a list of the species with the following commands:

     data(geospiza)
     attach(geospiza)
     geospiza.tree$tip.label

The data command loads the data set, and the attach command allows us to access objects within the data set, such as the tree and the morphological data. The last command will print out the tip.labels for the taxa in the geospiza.tree, so you should now see a list of 14 taxa.

Now we can assign states for our discrete character to these 14 species with the commands:

     char1<-c(1,1,1,1,1,0,1,0,0,1,0,1,1,1)
     names(char1)<-geospiza.tree$tip.label
     char1

The first line binds the 14 character states into a single vector called x. Then we assign the names (tip labels) from the tree to these 14 values. So now when you type the last command, char1, R will show you the content of that object, a vector of the values with the names of species associated with the values.

More commonly you will not be typing the character states into the command line but extracting them from a table. To do this, the process would be similar.

   MyData=read.table("MyData.txt",row.names=1)
   char1<-MyData[,1]
   names(char1)<-row.names(MyData)

The first line reads in the table, indicating that the first column contains the row.names (here, your taxa with names matching the tip.labels on your tree). The next line designates the first column ([,1]) after the row names as character 1. Then you can associate the taxon names to the character states.

Now you can fit a Jukes-Cantor model to this data and the tree by typing:

    ER<-fitDiscrete(geospiza.tree, char1)
    ER

Geiger will return to you the log likelihood (lnL) for this model (-9.61) and the estimated transition rate, q (-17.06).

Estimating a model with assymetrical transition rates in geiger

Now let's apply a model where the forward (0->1) rate is allowed to differ from the reverse rate (1->0). Type:

    ARD<-fitDiscrete(geospiza.tree,char1,model="ARD")
    ARD

ARD stand for all rates different. Now the function returns a likelihood of -7.97 and an estimated rate matrix:

    $Trait1$q
               [,1]      [,2]
    [1,] -17.908356 17.908356
    [2,]   5.995629 -5.995629

So we see that the estimated q for 0->1 is 17.9 and for 1->0 is 6.

Comparing models with the likelihood ratio test

We can use a chi-squared test to compare the likelihoods of these two models by typing:

    1-pchisq(2*(ARD$Trait1$lnl-ER$Trait1$lnl),1)

We find that the p-value (with 1 degree of freedom) is 0.07, so the assymetrical model is not significantly better than the symmetrical (Jukes-Cantor) model.

Geiger includes other model options, such as transforming the tree with Pagel's lambda.

    LAMBDA<-fitDiscrete(geospiza.tree, char1, treeTransform="lambda")
    LAMBDA

Geiger estimates lambda as 1.06x10-6. We can again use the likelihood ratio test to compare this model with the Jukes-Cantor model.

    1-pchisq(2*(LAMBDA$Trait1$lnl-ER$Trait1$lnl),1)

The p-value is 0.295, so adding lambda to the model does not significantly improve its fit.