APE incomplete distance extension

From Phyloinformatics
Revision as of 20:20, 19 May 2011 by OpenIDUser7 (talk)
Jump to: navigation, search

Abstract: Constructing phylogenetic trees from measures of dissimilarity (distances) between species is a popular approach for inferring phylogenies, as it is fast and provides a useful snapshot of the data. The most popular method for such reconstruction is Neighbour Joining (N.J). Many improvements have been made to this algorithm since its release, for instance Bio-NJ, which have sped NJ up and have improved its accuracy. However in practice distance matrices with missing values (due to noise or lack of confidence in readings) are encountered. For such cases modifications of the above methods have been made, such as NJ*, which given an incomplete distance matrix will return a phylogenetic tree from that distance matrix. The project aims to extends the APE R package (which currently contains implementations of NJ and Bio_NJ) to also include such methods for phylogenetic reconstruction. We will also add a non-agglomerative approach (the triangles method) to APE and later extend this to handle reconstruction from incomplete distances.

Project Plan:

Here is a week-by-week project schedule.

May 24th - May 31st:

Implementation of auxiliary methods for the triangles method. These are as follows:

- getPathBetween(int x,int y): get the path between the leaves x and y. All vertices will be identified by numbers in the spirit of the code that has already been written in APE. This method will take two integer values (representing the indices of the leaves) and return an array representing the path between the leaves.

- pred(int x): returns the index of the vertex which is the parent of the vertex indicated by x

- getDistanceBetween(int x, int y): returns the length of the path between x and y This is expected to take less than a week and any extra time will be used to start implementing the triangles method for complete distances.

June 1st - June 8th: In this week we begin implementation of the triangles method. This will involve writing code to do the following:

- construct an initial tree on three leaves that respects the distances in the input distance matrix

- choose a leaf l that is closest (as defined in the sense of this method) to the partially constructed tree, and also the two leaves x,y on the path between which this leaf is to be attached. Then attach l on the path between x and y such that the distances in the input distance matrix are respected (this can be done in a unique way).

- after adding a leaf l as above, recompute the distance between any leaves not yet added to the tree and the new tree

June 9th - June 15th: Extend the code for the triangles method to also allow non-binary phylogenetic trees to be returned. This will involve checking where the attachment point of a leaf l to be added the path between two leaves x and y is on that respective path and changing the new resulting tree accordingly (e.g if it is to be attached to a vertex already existent to that path, we just attach it to that vertex, rather than subdividing the path with an edge of weight 0).

This period will also serve to start interfacing the c code written up until now with R. Here we will be writing the 'wrapper' R function that will call our C code.

June 16th - June 23rd: This week will be for sorting any potential bugs which may have arisen from the interfacing of the C code with R. Also final testing of the code implemented so far will take place. If no issues are found the next segment of the timeline will begin.

June 24rd - July 1st: We will begin coding of the NJ* method for incomplete distance reconstruction. This will first involve implementing the new agglomeration criterion for this method, and also work on branch length estimation.

July 1st - July 8th: Here we will work on distance matrix reduction. We also begin writing the 'wrapper' R function that will call the NJ* C code.

July 9th - July 16th: This week will be spent on sorting out potential bugs that may have arisen from the NJ* code interfacing with R, and also final testing of this code.

July 17th - July 24th: We begin implementing the Bio-NJ* code. We begin by coding the agglomeration criterion and the branch length estimation.

July 25th -August 2nd: Coding the distance matrix reduction and writing the 'wrapper' R function, as well as debugging and testing.

August 3rd - August 10th: Here we begin extending the triangles method for incomplete distances. This time will be dedicated to writing a program to construct a phylogenetic tree from a subset of leaves such that the distances between them are complete (either using NJ or the triangles method). Also ranking leaves based on the number of known distances between them and the constructed subtree will be implemented here.

August 11th - August 18th: Here we write code that would use the methods from the previous week to reconstruct a phylogenetic tree from an incomplete distance matrix using the triangles method.

August 18th- August 23rd:

Slack week used to sort out any outstanding issues.