# PhyloSoC:An Extension of Mesquite based on PDSIMUL

## Contents

## News & Notes

### 2010

- 3/1: Begin maintenance.

### 2009

- hiatus

### 2008

- 10/16: Future time line proposed, project time line from GSoC removed.
- 8/21: Dynamic step sizes are being added which will (hopefully) dramatically increase the rate at which accurate distributions can be generated for large trees.
- 8/20: Well, pencils down day has come and passed, and now I'm going to post some data analysis!
- 6/16: Addition section about module architecture.

- 6/16: New updates to source code on Google code : You can now edit the covariance matrix, and create simulated matrices from the default Brownian motion model settings. The code is still very buggy, use only for development purposes.

- 6/9: Addition to the project overview.

- 5/22: added revisions to tree structure to project time-line, added links to various pages.

- 5/20: Looking at mesquite it does not appear to me that punctuated equilibrium, or models where change is tied to specialization events, can be modeled without revisions to the tree data structure. When a speciation event occur there is no way to specify which of the two contemporaneous daughter species represents the ancestral form, and has therefore not experienced a speciational even, and which represents the daughter species which has experienced a speciational event.

- 5/20: Set up project wiki.

## Data Analysis

In order to make the resure both myself and the end user of pdsims faithful implementation of PDSIMUL I have analyzed data generated from both programs from identical starting parameters.

### Brownian Motion

The following data was computed from 10,000 runs with default values for the 49LBR tree in both pdsim and PDSIMUL. The PDSIMUL data has two missing data points because differences in node naming prevented exporting values for those nodes.

Figure 1 shows individual node mean value surrounded by a 95% confidence interval. Nodes are arranged by distance from the root node. You can see that at 10 points the means leave the 95% confidence interval, but that is to be expected for 97 data nodes.

Figure 2 shows that pdsim generates a distribution of trait values at nodes which is almost identical to that generated by PDSIMUL. Using steps shows no affect on the distribution, which is to be expected since bounds were not used.

Figure 3 shows data from trees with a mean velocity of +2, again the results are again visually similar.

#### The Improtance of Step Size

One of the novel features of pdsim is the ability to check bounding of trait motion between nodes.

Because the distribution of the values at nodes is effected by step size choosing a step size can effect the experimental outcome.

Figure 4 shows data from truncation and flip bouding runs. When the truncation bounding rule is used we see a dramatic increase in the density of the nodes at the boundary values. The absorbing effect at the boundary is an artifact of large step size, since as the size of movement of the trait increase, the chance that it will step onto a boundary also increases. The figure also demonstrates that as smaller step size are used the results of truncation and flip bounding approach each other. This is true of all boundary rules which treat all trait values identically and are only applied when the tentative result actually crosses the boundary, however, the ways which large step sizes distort the distribution differs. Rules like flipping, and hard bounce that place transformed node values further away from the boundaries will lower the variance of the distribution, and rules that place the transformed value closer to the boundary, like truncation, will artificialy increase it.

I speculate that hard-bounce should approach the true distribution most quickly, since the rule results in an intermediate effect on the variance.

I am unsure how throw-out and replace will effect the distribution, however I believe they two should approach the hard-bounce distribution, for the previously mentioned reasons.

Since the variance of the distribution can either be increased or decreased the large step size can elevate rates of type I and II errors depending on the rule being applied.

(planed figure) Figure 6: Soft bounding also shows a dramatic step size affect, but does not approach the same distribution as truncation, replace, flip and hard bounce. This is because the bounding rule for soft bounding is applied over a large region of the graph, and as a result the true distribution is transformed as well.

### Ornstein-Uhlenbeck

Although in Figure 5: the two programs distributions also look visually similar, I am concerned that the variance of the distributions may be different.

## Project information

### Overview

This module will add the evolutionary simulation provided by PDAP to mesquite, as well as tweak some of the continuous character UI.

There are two fundamental simulations that can be run in this package, Brownian Motion and Ornstein-Uhlenbeck.

The Ornstein-Uhlenbeck process is a stochastic function that exhibits a mean reverting tendency. It can be thought of as a moving average.

Both models can be simulated in a gradual fashion, where the traits of daughter nodes depend on the branch length, and in a speciational fashion, where all non-zero branch lengths are set to 1.

Finally, there is a punctuated equilibrium model which implements a speciational Brownian motion model, with the exception that only one daughter species experiences a change of traits.

#### A User Guide

##### Instillation

- Download the latest code from here [1]

- unzip the contents of the file to a directory of choice.

- edit the classpaths.xml file in the directory Mesquite_Folder to add reference to the new direcrectoy.

###### Example

unzip pdapsim.07.07.zip to "C:\Program Files\Mesquite_Folder\mesquite\"

run notepad.exe and open "C:\Program_Files\Mesquite_Folder\classpaths.xml"

add the line

<classpath>C:\Program_Files\Mesquite_Folder\pdsim</classpath>

to the file.

After editing the file should look something like this:

<?xml version="1.0"> <mesquite> <classpath>C:\Program_Files\Mesquite_Folder\pdsim</classpath> </mesquite>

save and run mesquite.

#### Using PDAPsim

The program adds three new models for the user.

Under the "New Character Model" menu in the character tab you should now see a Covariance Matrix Model, Brownian Motion and Ornstein-Uhlenbeck menu item. The Ornstein-Uhlenbeck and Brownian Motion models can be used like continuous character models, and should be available for any calculations you would do with continuous character models.

###### Co Variance matrix

The Co Variance matrix requires a Tree for running simulations, and a continuous trait matrix to calculate default values for starting models. You should create these before creating a new Co Variance matrix.

Upon creating a co variance matrix, you will see a square matrix with the name of each trait in the continuous trait matrix as a column title, and all row titles set to "<trait name>: unassigned."

This is the covariance matrix showing the correlation between all pairs of traits, and the default values were calculated from the correlation of traits in the trait matrix. The row titles also tell you what model controls the evolution of a particular trait. To assign a model to control trait evolution, simply right click the row title.

When a model is assigned to control a trait, a dialog will ask you if you wish to assign default values for the character to that model. By doing so you will lose any current settings of the model. The default settings are designed to help create a null hypothesis for later statistic tests.

Evolution of the traits can be simulated on the tree once all the traits are assigned models. Under the Co Variance matrix menu chose the "Run Simulation" option. A diabog box will prompt you for a model name, the number of replicates to run, and a step size.

- The step size is only changes results when bounding is used. A non-zero step size causes values for all traits to be calculated each step along branches. This increases the amount of necessary processing time.

**Warning: Some branch lengths can be very long (10^6), and specify a small step size (10) will take an unreasonable amount of time to complete simulations. When specifying step sizes, make sure to take branch lengths into account.**

- .sim files only save the information for the first two traits, but can be opened by PDANOVA for statistical processing.
- .csv files save the results in a comma separated format that can be opened by most spreadsheet software.

Other Options: There is also an option to show the cholesky matrix used to generate correlated random numbers, and an option to add new continuous character simulation. Continuous character models can also be created with the new model option under character models.

###### Brownian Motion

**Trait variance:**This represents the variance around a mean arising from random motion of the trait. If Variance of mean at tips is 0 then this represents the expected variance of the trait at tips of the tree. I have tested to make sure this works for trees with contemporaneous tips, and I believe it should generally work, but temporally heterogeneous tips are untested.

**Root state:**Simply the value of the trait at the root node of the tree.

**Mean state at tips:**The expected mean state of the trait at the tips of the tree. If this is different than the root state there is directional component to the motion of traits. This is somewhat analogous to the mean velocity of a group molecules diffusing in a liquid. If a trait as a whole has a tendency to move over evolutionary time then this is reflected by a difference between root states and mean tip states.

**Variance of mean at tips**This value represents the variance of the mean velocity of traits. Though trait means do not exist for individual nodes per say, a variance of means at tips results in a distribution of velocity of traits. More practically the total variance of a trait at the tips will be the sum of the trait variance and the variance of the mean. Also, trait movement is drawn from a correlated normal distribution and the mean movement is not, so the correlation between traits will also be lower as the variance of the mean at tips is greater.

**Ignore branch length**Tells the simulator to treat all branches as 1 unit in length.

**Punctuated**Tells the simulator to treat nodes with the same name as representing the same unchanged species. All chance is distributed to nodes who's names differ from the parent nodes name.

**Use bounding**Specifies that traits cannot take on values outside of certain bounds. When telling the simulator to use bounding it is also

important to consider how often the simulator checks to see if traits have gone out of bounds on longer branches. I have not developed any guidelines on setting step size yet. When specified the lower and upper bound values are used, and of course the lower bound value must be smaller than the upper bound value. upper bound:

**Bounding Types**Types of bounding are described in the bounding section.

**Set to defaults**Prompts you for a matrix and character to set as the default values for this model.

##### Ornstein-Uhlenbeck

Most of the parameters of the Ornstein Uhlenbeck model are similar to those in the brownian motion model, the key differences are the use of peaks instead of means, and the addition of a decay variable.

- Peaks: peaks represent the mean the the Ornstein Uhlenbeck model has a tendency to revert to. Unlike means in brownain motion the peak variable is explicitly stored for each node during calculations.

- Decay: The decay is a measure of the strength of the tendency of this model to revert to the mean. Higher decays will lead to less variability around a peak.

#### Challenges

Currently I believe that two aspects of the UI which require the most planing are the way to specify speciation events for the Punctuated equilibrium model, and the form that output from simulation will take.

- A simple way to create output data would be to simply have linked character matrices, each matrix representing the results of the simulations for each particular trait. trait.

## Functions

### Simulating traits on a tree

One of the goals of this project is to allow the correlation of any number of continuous traits. This is accomplished by the user entering the portion of the CV matrix bellow the diagonal, though in a final version this may be done more indirectly. With a CV matrix a sequence of appropriately correlated variables is obtained by a relatively simple transformation of uncorrelated sequence obtained from a normal distribution.

#### Calculating traits.

Brownian motion is already implemented in Mesquite, but PDSIMUL allows a larger number of user specified values than mesquite. A number of variables are obtained from a standard normal distribution (which is quite easy in mesquite) and then transformed appropriately (according to the user specified variance, branch length, . . . ). These numbers are then simply added to the trait values of the ancestor node.

For OU simulations two variables are . . . .

#### Bounding procedures.

A number of bounding procedures exist, and are set by the user as an parameter of the simulation.

- "Truncate Change" sets an out of bound trait to the boundary value.
- "Flip" causes the trait to step away from the bound instead of over it, if this move also takes the trait out of bounds then the trunctation method is used.
- "Hard Bounce" the trait moves its full step distance, but reveres the direction of motion at the bound.
- "Softbounce" reduces motion toward the bound by a factor dependent on the distance of a trait from the bound.

- Caveat: These bounding procedures are somewhat unrealistic with gradual OU, and some of them may be unrealistic with gradual Brownian motion as well. The bounding procedures are applied to nodes, this changes the distribution of out comes differently than if the bounding terms had been applied continuously along the branch length (i.e. incorporated into the equations). As a result one clade with many internal nodes which starts near a bound should have a different mean and variance than a clade with few internal nodes starting at the same point (and this should not be the case).

### The Mesquite Module

In order to properly implement the functionality of PDAP:Simul: I will need to store a large number of character matrices from simulations. This should actually be fairly easy, simply declare an array of character matrices, and fill them in as the simulation progresses. Character matrices can already easily be created with an arbitrary number of traits associated with tips. One alteration that may be useful, but is not strictly necessary, is to allow the auto-magic naming of nodes and their inclusion in the character matrix.

Tentatively the UI will consist of a window to edit the covariance matrix, a new item on the menu bar for various options, a new window to edit each of the models, a new tool on the tree editing window, and likely a new window for displaying simulation results, though perhaps these would just be new entries on the analysis menu.

#### Module Architecture

- CoVarMatrixCurator : This contains functions necessary for creating and maintaining models that will use correlated traits. It implements
- CoVarMatrixModel : Extends ProbabilityContCharModel to handle correlated traits. Implements functions that all Correlated trait models will need, certain defaults, etc.
- CorTraits : The class that contains the core functions for creating correlated standard normal distributions.
- CoVarTable: The class that manages editing the Covariance matrix.
- CoVarTableWindow: The class that handles visual display of the covariance matrix.

Simulation Model:

- EvolveState(Double beginState, Tree tree, int node) : One difficulty that I'm running into is that a correlated trait model must keep track of which trait it is calculating, since different traits draw from different distribution of random numbers. This is currently implemented by creating an internal variable that keeps track of the number of times the EvolveState function has been called, assuming that the EvolveState function will be called exactly once for all nodes on the tree except the root node, and that the order of nodes which evolve state is passed is identical for each trait. If this behavior is kept for the final model, minimally I will need to verify these assumptions by keeping track of the node number passed to EvolveState and dealing properly with nodes which are called in an unusual order. Alternatively, it would produce more stable and predictable behavior if I replaced some of the higher level function with function more appropriate for calculating correlated trait distributions.

## Future Work

### 3/15/10

- Fix scroll buttons
- Fix references (loading pdsimul can cause unexplained crashes with some java/os combinations)

### 4/15/10

- Fix save files

### 5/15/10

- Fix dailog boxes.

### 6/15/10

- add new models?

### 11/15/10

- declare it done!
- Have a party.

## Request for features

Please please please bug me if you want me to add or change anynthing about this module.