APE incomplete distance extension
Abstract: Constructing phylogenetic trees from measures of dissimilarity (distances) between species is a popular approach for inferring phylogenies, as it is fast and provides a useful snapshot of the data. The most popular method for such reconstruction is Neighbour Joining (N.J). Many improvements have been made to this algorithm since its release, for instance Bio-NJ, which have sped NJ up and have improved its accuracy. However in practice distance matrices with missing values (due to noise or lack of confidence in readings) are encountered. For such cases modifications of the above methods have been made, such as NJ*, which given an incomplete distance matrix will return a phylogenetic tree from that distance matrix. The project aims to extends the APE R package (which currently contains implementations of NJ and Bio_NJ) to also include such methods for phylogenetic reconstruction. We will also add a non-agglomerative approach (the triangles method) to APE and later extend this to handle reconstruction from incomplete distances.
Here is a week-by-week project schedule.
What has been done up to this point
- triangles method for complete and incomplete distances implemented
- nj* and bio-nj* implemented
- all tested and ready for inclusion into APE
What will be done:
4th July - 11th July: begin implementing SDM, which is a super-matrix approach to phylogenetic reconstruction. This week will be dedicated to the implementation of the construction routine for the consesus matrix of the distance matrices on overlapping sets of species. The following steps need to be taken to do this:
- decide on a good way of storing and passing a set of matrices from R to a C function (this might involve some discussion with my mentors)
- construct a consesus matrix as shown in 
11th July - 18th July: this week will be dedicated to implementing MVR, which is a method of inferring a statistically viable phylogenetic tree from the above distance matrix. This will involve the following steps:
- calculation of the vector w of weights used to compute the lenghts of the edges joining the two agglomerated taxa to the tree
- calculation of the lambda value used to compute the distance between the taxa generated from agglomeration and the rest of the taxa
- implementation of variance matrix reduction
All of the above will be done using the guidelines shown in . If all goes well this week will also be used to start work on MVR*, which is just MVR adapted for incomplete distances. 18th July - 25th July:finish work on MVR*. This week will also be used for testing and debugging the above C code 25th July - 1st August:finish testing and interface the above C code with R 1st August - 8th August: We also plan to implement tree popping, which is a method for reconstructing phylogenetic trees from splits. For this purpose this week will be dedicated to the implementation of splits in the APE C code, as it does not have them modelled. 8th August - 15th August: This week will be used to implement the actual tree popping.
The last week will be a slack week used to sort out any outstanding issues.