Difference between revisions of "APE incomplete distance extension"

From Phyloinformatics
Jump to: navigation, search
Line 5: Line 5:
 
Here is a week-by-week project schedule.
 
Here is a week-by-week project schedule.
  
''May 24th - May 31st:''
+
What has been done up to this point
  
Implementation of auxiliary methods for the triangles method.
+
* triangles method for complete and incomplete distances implemented
These are as follows:
+
* nj* and bio-nj* implemented
 +
* all tested and ready for inclusion into APE
  
- getPathBetween(int x,int y): get the path between the leaves x and y. All vertices will be identified by numbers
+
What will be done:
in the spirit of the code that has already been written in APE. This method will take two integer values
 
(representing the indices of the leaves) and return an array representing the path between the leaves.
 
  
- pred(int x): returns the index of the vertex which is the parent of the vertex indicated by x
+
''4th July - 11th July'': begin implementing SDM, which is a super-matrix approach to phylogenetic reconstruction. This week will be dedicated to the implementation of the construction routine for the consesus matrix of the distance matrices on overlapping sets of species. The following steps need to be taken to do this:
 +
* decide on a good way of storing and passing a set of matrices from R to a C function (this might involve some discussion with my mentors)
 +
* construct a consesus matrix as shown in [1]
 +
''11th July - 18th July'': this week will be dedicated to implementing MVR, which is a method of inferring a statistically viable phylogenetic tree from the above distance matrix. This will involve the following steps:
 +
* calculation of the vector w of weights used to compute the lenghts of the edges joining the two agglomerated taxa to the tree
 +
* calculation of the lambda value used to compute the distance between the taxa generated from agglomeration and the rest of the taxa
 +
* implementation of variance matrix reduction
 +
All of the above will be done using the guidelines shown in [1].
 +
If all goes well this week will also be used to start work on MVR*, which is just MVR adapted for incomplete distances.
 +
''18th July - 25th July'':finish work on MVR*. This week will also be used for testing and debugging the above C code
 +
''25th July - 1st August'':finish testing and interface the above C code with R
 +
''1st August - 8th August'': We also plan to implement tree popping, which is a method for reconstructing phylogenetic trees from splits. For this purpose this week will be dedicated to the implementation of splits in the APE C code, as it does not have them modelled.
 +
''8th August - 15th August'': This week will be used to implement the actual tree popping.
  
- getDistanceBetween(int x, int y): returns the length of the path between x and y
+
The last week will be a slack week used to sort out any outstanding issues.
This is expected to take less than a week and any extra time will be used to start implementing the triangles
 
method for complete distances.
 
 
 
''June 1st - June 8th:''
 
In this week we begin implementation of the triangles method. This will involve writing code to do the following:
 
 
 
- construct an initial tree on three leaves that respects the distances in the input distance matrix
 
 
 
- choose a leaf l that is closest (as defined in the sense of this method) to the partially constructed tree, and
 
also the two leaves x,y on the path between which this leaf is to be attached. Then attach l on the path between x
 
and y such that the distances in the input distance matrix are respected (this can be done in a unique way).
 
 
 
- after adding a leaf l as above, recompute the distance between any leaves not yet added to the tree and the new tree
 
 
 
''June 9th - June 15th:''
 
Extend the code for the triangles method to also allow non-binary phylogenetic trees to be returned. This will involve
 
checking where the attachment point of a leaf l to be added the path between two leaves x and y is on that respective path
 
and changing the new resulting tree accordingly (e.g if it is to be attached to a vertex already existent to that path,
 
we just attach it to that vertex, rather than subdividing the path with an edge of weight 0).
 
 
 
This period will also serve to start interfacing the c code written up until now with R. Here we will be writing the 'wrapper'
 
R function that will call our C code.
 
 
 
''June 16th - June 23rd:''
 
This week will be for sorting any potential bugs which may have arisen from the interfacing of the C code with R. Also final testing of the code implemented so far will take place. If no issues are found the next segment of the timeline will begin.
 
 
 
''June 24rd - July 1st:''
 
We will begin coding of the NJ* method for incomplete distance reconstruction. This will first involve implementing the new agglomeration criterion for this method, and also work on branch length estimation.
 
 
 
''July 1st - July 8th:''
 
Here we will work on distance matrix reduction. We also begin writing the 'wrapper' R function that will call the NJ* C code.
 
 
 
''July 9th - July 16th:''
 
This week will be spent on sorting out potential bugs that may have arisen from the NJ* code interfacing with R, and also final testing of this code.
 
 
 
''July 17th - July 24th:''
 
We begin implementing the Bio-NJ* code. We begin by coding the agglomeration criterion and the branch length estimation.
 
 
 
''July 25th -August 2nd:''
 
Coding the distance matrix reduction and writing the 'wrapper' R function, as well as debugging and testing.
 
 
 
''August 3rd - August 10th:''
 
Here we begin extending the triangles method for incomplete distances. This time will be dedicated to writing a program to construct a phylogenetic tree from a subset of leaves such that the distances between them are complete (either using NJ or the triangles method). Also ranking leaves based on the number of known distances between them and the constructed subtree will be implemented here.
 
 
 
''August 11th - August 18th:''
 
Here we write code that would use the methods from the previous week to reconstruct a phylogenetic tree from an incomplete distance matrix using the triangles method.
 
 
 
''August 18th- August 23rd:''
 
 
 
Slack week used to sort out any outstanding issues.
 

Revision as of 05:39, 4 July 2011

Abstract: Constructing phylogenetic trees from measures of dissimilarity (distances) between species is a popular approach for inferring phylogenies, as it is fast and provides a useful snapshot of the data. The most popular method for such reconstruction is Neighbour Joining (N.J). Many improvements have been made to this algorithm since its release, for instance Bio-NJ, which have sped NJ up and have improved its accuracy. However in practice distance matrices with missing values (due to noise or lack of confidence in readings) are encountered. For such cases modifications of the above methods have been made, such as NJ*, which given an incomplete distance matrix will return a phylogenetic tree from that distance matrix. The project aims to extends the APE R package (which currently contains implementations of NJ and Bio_NJ) to also include such methods for phylogenetic reconstruction. We will also add a non-agglomerative approach (the triangles method) to APE and later extend this to handle reconstruction from incomplete distances.

Project Plan:

Here is a week-by-week project schedule.

What has been done up to this point

  • triangles method for complete and incomplete distances implemented
  • nj* and bio-nj* implemented
  • all tested and ready for inclusion into APE

What will be done:

4th July - 11th July: begin implementing SDM, which is a super-matrix approach to phylogenetic reconstruction. This week will be dedicated to the implementation of the construction routine for the consesus matrix of the distance matrices on overlapping sets of species. The following steps need to be taken to do this:

  • decide on a good way of storing and passing a set of matrices from R to a C function (this might involve some discussion with my mentors)
  • construct a consesus matrix as shown in [1]

11th July - 18th July: this week will be dedicated to implementing MVR, which is a method of inferring a statistically viable phylogenetic tree from the above distance matrix. This will involve the following steps:

  • calculation of the vector w of weights used to compute the lenghts of the edges joining the two agglomerated taxa to the tree
  • calculation of the lambda value used to compute the distance between the taxa generated from agglomeration and the rest of the taxa
  • implementation of variance matrix reduction

All of the above will be done using the guidelines shown in [1]. If all goes well this week will also be used to start work on MVR*, which is just MVR adapted for incomplete distances. 18th July - 25th July:finish work on MVR*. This week will also be used for testing and debugging the above C code 25th July - 1st August:finish testing and interface the above C code with R 1st August - 8th August: We also plan to implement tree popping, which is a method for reconstructing phylogenetic trees from splits. For this purpose this week will be dedicated to the implementation of splits in the APE C code, as it does not have them modelled. 8th August - 15th August: This week will be used to implement the actual tree popping.

The last week will be a slack week used to sort out any outstanding issues.