Difference between revisions of "APE incomplete distance extension"

From Phyloinformatics
Jump to: navigation, search
 
(2 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
Here is a week-by-week project schedule.
 
Here is a week-by-week project schedule.
  
''May 24th - May 31st:''
+
What has been done up to this point
  
Implementation of auxiliary methods for the triangles method.
+
* triangles method for complete and incomplete distances implemented
These are as follows:
+
* nj* and bio-nj* implemented
- getPathBetween(int x,int y): get the path between the leaves x and y. All vertices will be identified by numbers
+
* all tested and ready for inclusion into APE
in the spirit of the code that has already been written in APE. This method will take two integer values
 
(representing the indices of the leaves) and return an array representing the path between the leaves.
 
- pred(int x): returns the index of the vertex which is the parent of the vertex indicated by x
 
- getDistanceBetween(int x, int y): returns the length of the path between x and y
 
This is expected to take less than a week and any extra time will be used to start implementing the triangles  
 
method for complete distances.
 
''June 1st - June 8th:''
 
In this week we begin implementation of the triangles method. This will involve writing code to do the following:
 
- construct an initial tree on three leaves that respects the distances in the input distance matrix
 
- choose a leaf l that is closest (as defined in the sense of this method) to the partially constructed tree, and
 
also the two leaves x,y on the path between which this leaf is to be attached. Then attach l on the path between x
 
and y such that the distances in the input distance matrix are respected (this can be done in a unique way).
 
- after adding a leaf l as above, recompute the distance between any leaves not yet added to the tree and the new tree
 
''June 9th - June 15th:''
 
Extend the code for the triangles method to also allow non-binary phylogenetic trees to be returned. This will involve
 
checking where the attachment point of a leaf l to be added the path between two leaves x and y is on that respective path
 
and changing the new resulting tree accordingly (e.g if it is to be attached to a vertex already existent to that path,
 
we just attach it to that vertex, rather than subdividing the path with an edge of weight 0).
 
  
This period will also serve to start interfacing the c code written up until now with R. Here we will be writing the 'wrapper'
+
What will be done:
R function that will call our C code.
 
  
''June 16th - June 23rd:''
+
''4th July - 11th July'': begin implementing SDM, which is a super-matrix approach to phylogenetic reconstruction. This week will be dedicated to the implementation of the construction routine for the consesus matrix of the distance matrices on overlapping sets of species. The following steps need to be taken to do this:
This week will be for sorting any potential bugs which may have arisen from the interfacing of the C code with R. Also final testing of the code implemented so far will take place. If no issues are found the next segment of the timeline will begin.
+
* decide on a good way of storing and passing a set of matrices from R to a C function (this might involve some discussion with my mentors)
 +
* construct a consesus matrix as shown in [1]
 +
''11th July - 18th July'': this week will be dedicated to implementing MVR, which is a method of inferring a statistically viable phylogenetic tree from the above distance matrix. This will involve the following steps:
 +
* calculation of the vector w of weights used to compute the lenghts of the edges joining the two agglomerated taxa to the tree
 +
* calculation of the lambda value used to compute the distance between the taxa generated from agglomeration and the rest of the taxa
 +
* implementation of variance matrix reduction
 +
All of the above will be done using the guidelines shown in [1].
 +
If all goes well this week will also be used to start work on MVR*, which is just MVR adapted for incomplete distances.
  
''June 24rd - July 1st:''
 
We will begin coding of the NJ* method for incomplete distance reconstruction. This will first involve implementing the new agglomeration criterion for this method, and also work on branch length estimation.
 
  
''July 1st - July 8th:''
+
''18th July - 25th July'':finish work on MVR*. This week will also be used for testing and debugging the above C code
Here we will work on distance matrix reduction
+
 
 +
 
 +
''25th July - 1st August'':finish testing and interface the above C code with R
 +
 
 +
 
 +
''1st August - 8th August'': We also plan to implement tree popping, which is a method for reconstructing phylogenetic trees from splits. For this purpose this week will be dedicated to the implementation of splits in the APE C code, as it does not have them modelled.
 +
 
 +
 
 +
''8th August - 15th August'': This week will be used to implement the actual tree popping.
 +
 
 +
 
 +
The last week will be a slack week used to sort out any outstanding issues.
 +
 
 +
[1] http://www.biomedcentral.com/1471-2105/9/166

Latest revision as of 05:42, 4 July 2011

Abstract: Constructing phylogenetic trees from measures of dissimilarity (distances) between species is a popular approach for inferring phylogenies, as it is fast and provides a useful snapshot of the data. The most popular method for such reconstruction is Neighbour Joining (N.J). Many improvements have been made to this algorithm since its release, for instance Bio-NJ, which have sped NJ up and have improved its accuracy. However in practice distance matrices with missing values (due to noise or lack of confidence in readings) are encountered. For such cases modifications of the above methods have been made, such as NJ*, which given an incomplete distance matrix will return a phylogenetic tree from that distance matrix. The project aims to extends the APE R package (which currently contains implementations of NJ and Bio_NJ) to also include such methods for phylogenetic reconstruction. We will also add a non-agglomerative approach (the triangles method) to APE and later extend this to handle reconstruction from incomplete distances.

Project Plan:

Here is a week-by-week project schedule.

What has been done up to this point

  • triangles method for complete and incomplete distances implemented
  • nj* and bio-nj* implemented
  • all tested and ready for inclusion into APE

What will be done:

4th July - 11th July: begin implementing SDM, which is a super-matrix approach to phylogenetic reconstruction. This week will be dedicated to the implementation of the construction routine for the consesus matrix of the distance matrices on overlapping sets of species. The following steps need to be taken to do this:

  • decide on a good way of storing and passing a set of matrices from R to a C function (this might involve some discussion with my mentors)
  • construct a consesus matrix as shown in [1]

11th July - 18th July: this week will be dedicated to implementing MVR, which is a method of inferring a statistically viable phylogenetic tree from the above distance matrix. This will involve the following steps:

  • calculation of the vector w of weights used to compute the lenghts of the edges joining the two agglomerated taxa to the tree
  • calculation of the lambda value used to compute the distance between the taxa generated from agglomeration and the rest of the taxa
  • implementation of variance matrix reduction

All of the above will be done using the guidelines shown in [1]. If all goes well this week will also be used to start work on MVR*, which is just MVR adapted for incomplete distances.


18th July - 25th July:finish work on MVR*. This week will also be used for testing and debugging the above C code


25th July - 1st August:finish testing and interface the above C code with R


1st August - 8th August: We also plan to implement tree popping, which is a method for reconstructing phylogenetic trees from splits. For this purpose this week will be dedicated to the implementation of splits in the APE C code, as it does not have them modelled.


8th August - 15th August: This week will be used to implement the actual tree popping.


The last week will be a slack week used to sort out any outstanding issues.

[1] http://www.biomedcentral.com/1471-2105/9/166