APE incomplete distance extension
Abstract: Constructing phylogenetic trees from measures of dissimilarity (distances) between species is a popular approach for inferring phylogenies, as it is fast and provides a useful snapshot of the data. The most popular method for such reconstruction is Neighbour Joining (N.J). Many improvements have been made to this algorithm since its release, for instance Bio-NJ, which have sped NJ up and have improved its accuracy. However in practice distance matrices with missing values (due to noise or lack of confidence in readings) are encountered. For such cases modifications of the above methods have been made, such as NJ*, which given an incomplete distance matrix will return a phylogenetic tree from that distance matrix. The project aims to extends the APE R package (which currently contains implementations of NJ and Bio_NJ) to also include such methods for phylogenetic reconstruction. We will also add a non-agglomerative approach (the triangles method) to APE and later extend this to handle reconstruction from incomplete distances.
Here is a week-by-week project schedule.
May 24th - May 31st:
Implementation of auxiliary methods for the triangles method. These are as follows: - getPathBetween(int x,int y): get the path between the leaves x and y. All vertices will be identified by numbers in the spirit of the code that has already been written in APE. This method will take two integer values (representing the indices of the leaves) and return an array representing the path between the leaves. - pred(int x): returns the index of the vertex which is the parent of the vertex indicated by x - getDistanceBetween(int x, int y): returns the length of the path between x and y This is expected to take less than a week and any extra time will be used to start implementing the triangles method for complete distances. June 1st - June 8th: In this week we begin implementation of the triangles method. This will involve writing code to do the following: - construct an initial tree on three leaves that respects the distances in the input distance matrix - choose a leaf l that is closest (as defined in the sense of this method) to the partially constructed tree, and also the two leaves x,y on the path between which this leaf is to be attached. Then attach l on the path between x and y such that the distances in the input distance matrix are respected (this can be done in a unique way). - after adding a leaf l as above, recompute the distance between any leaves not yet added to the tree and the new tree June 9th - June 15th: Extend the code for the triangles method to also allow non-binary phylogenetic trees to be returned. This will involve checking where the attachment point of a leaf l to be added the path between two leaves x and y is on that respective path and changing the new resulting tree accordingly (e.g if it is to be attached to a vertex already existent to that path, we just attach it to that vertex, rather than subdividing the path with an edge of weight 0).
This period will also serve to start interfacing the c code written up until now with R. Here we will be writing the 'wrapper' R function that will call our C code.
June 16th - June 23rd: This week will be for sorting any potential bugs which may have arisen from the interfacing of the C code with R. Also final testing of the code implemented so far will take place. If no issues are found the next segment of the timeline will begin.
June 24rd - July 1st: We will begin coding of the NJ* method for incomplete distance reconstruction. This will first involve implementing the new agglomeration criterion for this method, and also work on branch length estimation.
July 1st - July 8th: Here we will work on distance matrix reduction