PhyloSoC: Summary and visualization of phylogenetic tree sets

From Phyloinformatics
Revision as of 10:55, 20 July 2012 by Justs (talk) (-> Week 9)
Jump to: navigation, search


Justs Zariņš


Abstract: To create “Mastodon”, a Java tool that works with already established open-source programs to summarize and visualize a large set of phylogenetic trees. Specifically, I am interested in implementing the tree pruning algorithms presented in Karen Cranston and Bruce Rannala's paper with an emphasis on good scaling with larger input sets.

Downloads: Downloads@GitHub

Source Code: Mastodon@GitHub

Relevant projects:



Performance mostly depends on the type of data examined. Something with a lot of noise (unique clades) will take much longer to process than a data set where the clades only change positions.

A spreadsheet with some performance stats


Tree pruning





Say you have a set of trees:

tree1 = ((A,(B,D)),(C,(F,E)));

tree2 = ((A,B),(D,(C,(F,E))));

tree3 = (((A,D),B),(C,(F,E)));

tree4 = (((A,B),D),((C,F),E));

First create a list of all unique taxa (order must be maintained): [A,B,C,D,E,F]

BitSet - essentially an array of true/false or 1/0 values.

A clade can be represented by setting (making equal to 1) the bits in a BitSet that correspond to the list of unique taxa. So a clade (A,(B,D)) will be [110100]. Note that the structure of the clade is not specified, only the taxa it contains. To specify the nesting of clades, a list of all sub-clades needs to be made.

For example, (A,(E,(B,D))) is completely described by: {[110110], [010110], [010100]}

This is called a BitTree in MASTodon. The actual BitTree object can hold other needed information about the tree.


This is the main working part of the wiki page. The following sections will be updated regularly as the project unfolds. The past, present and future collide: details of what has happened in previous weeks and what is planned for the next ones will be added, along with current developments.

Community bonding period

  • April 23 - April 27
    • Signed up for the wg-phyloinformatics mailing list
    • Met up with Andrew to discuss logistics for the summer
  • April 30 - May 4
    • Created wiki page for project
    • Created GitHub code repository
  • May 7 - May 11
    • Had a G+ meeting with the mentors, Andrew, Karen and Ben
      • Periodic meetings will be held on Friday mornings at about 09:15EST / 14:15BST
      • Main points of meetings will be summarized in this wiki page. As is being done here
      • Decided to use a circle on G+ for project discussions
  • May 14 - May 18
    • Set up the work computer. Installed Eclipse and the GitHub tool for commiting etc
    • Created Eclipse project for Mastodon and tested that committing to the repository works
    • Downloaded JEBL source code from SVN repository
    • Improved the wiki page with better formatting and more details

Week 1


  • Obtain datasets for testing
    • a small (about 10 trees) for quick testing
    • a large (about 10000 trees) for testing how large input affects performance
    • (maybe) one of the original datasets used in Karen's paper to compare results
  • Incorporate JEBL. The goal is to have a framework where handling tree data is easy and universally available throughout the program, perhaps with an imported class that contains all necessary methods. Most important classes from JEBL seem to be, for I/O and along with implementations to store tree data, for example
  • Explore if trees should be stored as an object in memory or, if the trees are large or many, serialize and store on the hard drive.
  • Write a test to import the small set and export it again. Check consistency.


  • Obtained Karen's carnivore data.
  • Made a class for reading in Nexus trees into memory. Trees can be stored as mutable, immutable or compact rooted trees as defined in JEBL.
  • Made a class for writing Nexus trees to file.
  • Wrote simple reading/writing and consistency checking test cases.

Week 2


  • Implement a simple algorithm interface. This would mandate that an algorithm implementing this interface:
    • accepts a set of trees as input
    • has a run method
    • has an output method that returns a list of pruned taxon and the resulting trees and their MAP scores
  • Write a MAP score calculator (needed for algorithms) for a set of trees.
  • Implement performance measurements.
    • Output results in a CSV file.
    • Check for overall running time and see approximately how it scales with the size of input. More importantly, record intermediate measurements in equal increments (say, after the processing of each tree) to detect if there is a slow-down as the program keeps running.
  • Start implementing the MH algorithm.


  • Extended the tree reader to also support Newick files.
  • Implemented a simple algorithm interface.
  • Wrote a MAP score calculator for a set of trees. Found out that I'll need more than one version, e.g. weighted/unweighted trees.
  • Started implementing the MH algorithm. Made a working version but not quite ready to handle real data.
  • Extra: largely implemented the "BitTree" framework for storing and manipulating trees as BitSet objects. This should help with handling large sets of trees.

Week 3


  • Finish up the simple tree MH algorithm implementations for weighted trees.
  • Work on an implementation of the above for BitTrees.
  • Clean up some of the lingering oddities in the BitTree code.
  • Test MH algorithm with the small set.
  • Check performance with the large set.


  • Fleshed out the initial implementation of the MH algorithm using Tree objects.
  • Implemented the algorithm using BitTrees. I'll focus on using BitTrees from now on as they are easy to work with and have performance benefits.
  • Tested the program with a 10k tree data set. It appears stable, no odd slow-downs or memory leaks. Unsure if the solutions are being found optimally.

Week 4


  • Create a simple command line interface that collects user input (set of trees, threshold variables), runs algorithm and outputs an annotated NEXUS most-supported tree along with a text list of pruned taxon.
  • Reduce memory usage by piping through data that doesn't need to be stored.
  • Hammer bugs and refactor.


  • Completed a first version of the program and put in the shared DropBox folder. It can search a set of trees for a common subtree. Output is a text file that lists pruned taxa together with the subtree's score and number of matching subtrees in the set, along with an annotated nexus file that colors the path from pruned leaf down(or is it up?) to the root when opened in FigTree.
  • Implemented piping in data in batches as it was possible to do everything required without holding the whole set of trees. At the moment I'm piping in a 100 at a time which seems to work alright, a set with 1441 tips hovers at around 200MB of memory which seems acceptable.

Bits and pieces around the code:

  • Reverted to using SimpleRootedTree instead of MutableRootedTree everywhere. There was some shady code I had to add to MutableRootedTree for it to work, so using SimpleRootedTrees again removes one possible source of pain.
  • Found a redundancy that made turning the Tree objects to BitTrees faster.
  • People on the internet complain about iterating through a Map using .keySet() so I changed to using .getEntries() as recommended. Not too much performance gain here though.
  • Dropped the classes that dealt with plain Tree objects and finally merged the Experimental branch back into the master branch.
  • Filled in most of the missing JavaDoc.

Week 5

1st deliverable - a working MASTodon


  • Continue working on the user application. Specifically, implement a version where all input can be given through flags at the time of launching the program. This will enable the user to put the program as a part of a workflow.
  • Write some user documentation/instructions and attach to the executable program.
  • Some areas still remain where performance can be squeezed out. At the moment the program scales something like a quadratic with the number of leaf nodes.


Week 6


  • Do a bit of code cleanup, need to merge in the new tree comparison algorithm.
  • Package and upload MASTodon v0.2
  • Explore how to collect data about which taxa are pruned most frequently and see if there is good material here for a heatmap visualization.


  • Improved the pruning list perturbation code. It didn't always alter the correct number of taxa as prescribed by the Poisson distribution.
  • Fixed a maths error that caused bias to swap out taxa at the beginning of the list. Made it uniform across the whole list.

The two above have made the search algorithm converge better as it moves more freely across the space of taxa.

  • Deleted a bunch of dead code.
  • Uploaded a version of mastodon with the new pruning mechanism to my GitHub downloads page.
  • Put in some code that collects information about pruning combinations that produce a local improvement in the map score. Used this to display a basic heat map of pruning combinations of 2 taxa.
  • Added support for weighted trees.
  • Implemented a bisection search algorithm. It takes as inputs the target map score and how many iterations to stay in each step. Starting out at half of the taxa pruned, it moves in half jumps to find the number of taxa that need to be pruned in order to obtain the desired map score and then stays there until the target map score is achieved. Seems quite useful for Andrew's virus data as quite a lot of taxa have to pruned away there to see any sort of skeleton tree emerge.
  • Found and fixed a bug that shows up when many taxa are being pruned as a bonus.
  • Found a bug where a tree is declared equal to the map tree when it's not. Put in a fix that deals with this. Need to investigate why this seems to happen only rarely.
  • Learned how to use the Eclipse debugger.

Week 7


Start working on visual features.

  • GUI for user input, allow to run the program multiple times without rebuilding the tree set.
  • Integration with FigTree for results.
  • Click-Pruning.


  • Spent some time learning about the more advanced features of visual Java.
  • Plugged in bits of code in FigTree's source to make a new menu item that allows to execute the pruning algorithm and send the output trees straight to the display.
  • Found other instances where the new pruning mechanism calls two trees equal when they are not. This is good because previously there was only one instance of this so I wasn't sure if it's an error in the code or a concept.
  • Moved the above implemented functionality to a simpler template. Mastodon probably won't need all the tools that are available in FigTree.

Week 8 - Midterm evaluation


  • Implement a stand-alone Mastodon graphical application.
  • Implement a simulated annealing pruning algorithm.
  • Implement workflow:
    • select tree set
    • run bisection algorithm to find a reasonable number of taxa to prune
    • options to run a MH or SA algorithm (exploratory or aggressive)
    • manual pruning options

A sketch of the app and the workflow:

Sketch of App


  • Preliminary simulated annealing implementation.
  • Functional GUI.

Mastodon GUI v0.1

-> Week 9

2nd deliverable


  • Modify the algorithms interface to make it more modular.
  • Make user interface more intuitive. (probably means getting rid of list of runs)
  • Allow the user to specify what combination of algorithm and pruning number finder to use.
  • Aim to deliver a program with native tree visualization and variation exploration capabilities.


  • Rewrote the algorithm interface to factor out some of the repeated code.
  • Changed algorithm selection dialog to allow any combination of linear/constant/bisection searching and Metropolis Hastings /Simulated annealing algorithms.
  • Added 3 tree coloring buttons: none, pruned, pruning frequency.
  • Added clade probability display option.
  • Changed results table to list all taxa together with their pruning frequencies.
  • Other bits and pieces, like table sorting, a selected taxon in the table highlights in the tree and bugfixes.
  • Created a [runnable Mastodon App]

MastodonApp v0.1

Weeks 10 and 11

  • Investigate and implement additional visualizations.
  • If work has gone quicker than expected, add the TA algorithm. This depends on how much time is available and how long the TA algorithm took to implement, as both seem rather similar on the surface.
  • Begin code clean-up work.

Week 12

  • Streamline code and finish documentation.

Week 13

Soft pencils down week. Margin for error.

Week 14 - Final evaluation