MASTodon

Project

Abstract: To create “Mastodon”, a Java tool that works with already established open-source programs to summarize and visualize a large set of phylogenetic trees. Specifically, I am interested in implementing the tree pruning algorithms presented in Karen Cranston and Bruce Rannala's paper with an emphasis on good scaling with larger input sets.

Source Code: Mastodon@GitHub

Relevant projects:

Performance

Performance mostly depends on the type of data examined. Something with a lot of noise (unique clades) will take much longer to process than a data set where the clades only change positions.

A spreadsheet with some performance stats [some additional checks have since slowed down the algorithm]

Ideas

BitTrees

Say you have a set of trees:

tree1 = ((A,(B,D)),(C,(F,E)));

tree2 = ((A,B),(D,(C,(F,E))));

tree3 = (((A,D),B),(C,(F,E)));

tree4 = (((A,B),D),((C,F),E));

First create a list of all unique taxa (order must be maintained): [A,B,C,D,E,F]

BitSet - essentially an array of true/false or 1/0 values.

A clade can be represented by setting (making equal to 1) the bits in a BitSet that correspond to the list of unique taxa. So a clade (A,(B,D)) will be [110100]. Note that the structure of the clade is not specified, only the taxa it contains. To specify the nesting of clades, a list of all sub-clades needs to be made.

For example, (A,(E,(B,D))) is completely described by: {[110110], [010110], [010100]}

This is called a BitTree in MASTodon. The actual BitTree object can hold other needed information about the tree.

Timeline

This is the main working part of the wiki page. The following sections will be updated regularly as the project unfolds. The past, present and future collide: details of what has happened in previous weeks and what is planned for the next ones will be added, along with current developments.

Community bonding period

• April 23 - April 27
• Signed up for the wg-phyloinformatics mailing list
• Met up with Andrew to discuss logistics for the summer
• April 30 - May 4
• Created wiki page for project
• Created GitHub code repository
• May 7 - May 11
• Had a G+ meeting with the mentors, Andrew, Karen and Ben
• Periodic meetings will be held on Friday mornings at about 09:15EST / 14:15BST
• Main points of meetings will be summarized in this wiki page. As is being done here
• Decided to use a circle on G+ for project discussions
• May 14 - May 18
• Set up the work computer. Installed Eclipse and the GitHub tool for commiting etc
• Created Eclipse project for Mastodon and tested that committing to the repository works
• Improved the wiki page with better formatting and more details

Week 1

Planned

• Obtain datasets for testing
• a small (about 10 trees) for quick testing
• a large (about 10000 trees) for testing how large input affects performance
• (maybe) one of the original datasets used in Karen's paper to compare results
• Incorporate JEBL. The goal is to have a framework where handling tree data is easy and universally available throughout the program, perhaps with an imported class that contains all necessary methods. Most important classes from JEBL seem to be NexusImporter.java, NexusExporter.java for I/O and Tree.java along with implementations to store tree data, for example RootedTree.java
• Explore if trees should be stored as an object in memory or, if the trees are large or many, serialize and store on the hard drive.
• Write a test to import the small set and export it again. Check consistency.

Completed

• Obtained Karen's carnivore data.
• Made a class for reading in Nexus trees into memory. Trees can be stored as mutable, immutable or compact rooted trees as defined in JEBL.
• Made a class for writing Nexus trees to file.
• Wrote simple reading/writing and consistency checking test cases.

Week 2

Planned

• Implement a simple algorithm interface. This would mandate that an algorithm implementing this interface:
• accepts a set of trees as input
• has a run method
• has an output method that returns a list of pruned taxon and the resulting trees and their MAP scores
• Write a MAP score calculator (needed for algorithms) for a set of trees.
• Implement performance measurements.
• Output results in a CSV file.
• Check for overall running time and see approximately how it scales with the size of input. More importantly, record intermediate measurements in equal increments (say, after the processing of each tree) to detect if there is a slow-down as the program keeps running.
• Start implementing the MH algorithm.

Completed

• Extended the tree reader to also support Newick files.
• Implemented a simple algorithm interface.
• Wrote a MAP score calculator for a set of trees. Found out that I'll need more than one version, e.g. weighted/unweighted trees.
• Started implementing the MH algorithm. Made a working version but not quite ready to handle real data.
• Extra: largely implemented the "BitTree" framework for storing and manipulating trees as BitSet objects. This should help with handling large sets of trees.

Week 3

Planned

• Finish up the simple tree MH algorithm implementations for weighted trees.
• Work on an implementation of the above for BitTrees.
• Clean up some of the lingering oddities in the BitTree code.
• Test MH algorithm with the small set.
• Check performance with the large set.

Completed

• Fleshed out the initial implementation of the MH algorithm using Tree objects.
• Implemented the algorithm using BitTrees. I'll focus on using BitTrees from now on as they are easy to work with and have performance benefits.
• Tested the program with a 10k tree data set. It appears stable, no odd slow-downs or memory leaks. Unsure if the solutions are being found optimally.

Week 4

Planned

• Create a simple command line interface that collects user input (set of trees, threshold variables), runs algorithm and outputs an annotated NEXUS most-supported tree along with a text list of pruned taxon.
• Reduce memory usage by piping through data that doesn't need to be stored.
• Hammer bugs and refactor.

Completed

• Completed a first version of the program and put in the shared DropBox folder. It can search a set of trees for a common subtree. Output is a text file that lists pruned taxa together with the subtree's score and number of matching subtrees in the set, along with an annotated nexus file that colors the path from pruned leaf down(or is it up?) to the root when opened in FigTree.
• Implemented piping in data in batches as it was possible to do everything required without holding the whole set of trees. At the moment I'm piping in a 100 at a time which seems to work alright, a set with 1441 tips hovers at around 200MB of memory which seems acceptable.

Bits and pieces around the code:

• Reverted to using SimpleRootedTree instead of MutableRootedTree everywhere. There was some shady code I had to add to MutableRootedTree for it to work, so using SimpleRootedTrees again removes one possible source of pain.
• Found a redundancy that made turning the Tree objects to BitTrees faster.
• People on the internet complain about iterating through a Map using .keySet() so I changed to using .getEntries() as recommended. Not too much performance gain here though.
• Dropped the classes that dealt with plain Tree objects and finally merged the Experimental branch back into the master branch.
• Filled in most of the missing JavaDoc.

Week 5

1st deliverable - a working MASTodon

Planned

• Continue working on the user application. Specifically, implement a version where all input can be given through flags at the time of launching the program. This will enable the user to put the program as a part of a workflow.
• Write some user documentation/instructions and attach to the executable program.
• Some areas still remain where performance can be squeezed out. At the moment the program scales something like a quadratic with the number of leaf nodes.

Completed

Week 6

Planned

• Do a bit of code cleanup, need to merge in the new tree comparison algorithm.
• Package and upload MASTodon v0.2
• Explore how to collect data about which taxa are pruned most frequently and see if there is good material here for a heatmap visualization.

Completed

• Improved the pruning list perturbation code. It didn't always alter the correct number of taxa as prescribed by the Poisson distribution.
• Fixed a maths error that caused bias to swap out taxa at the beginning of the list. Made it uniform across the whole list.

The two above have made the search algorithm converge better as it moves more freely across the space of taxa.

• Deleted a bunch of dead code.
• Put in some code that collects information about pruning combinations that produce a local improvement in the map score. Used this to display a basic heat map of pruning combinations of 2 taxa.
• Added support for weighted trees.
• Implemented a bisection search algorithm. It takes as inputs the target map score and how many iterations to stay in each step. Starting out at half of the taxa pruned, it moves in half jumps to find the number of taxa that need to be pruned in order to obtain the desired map score and then stays there until the target map score is achieved. Seems quite useful for Andrew's virus data as quite a lot of taxa have to pruned away there to see any sort of skeleton tree emerge.
• Found and fixed a bug that shows up when many taxa are being pruned as a bonus.
• Found a bug where a tree is declared equal to the map tree when it's not. Put in a fix that deals with this. Need to investigate why this seems to happen only rarely.
• Learned how to use the Eclipse debugger.

Week 7

Planned

Start working on visual features.

• GUI for user input, allow to run the program multiple times without rebuilding the tree set.
• Integration with FigTree for results.
• Click-Pruning.

Completed

• Spent some time learning about the more advanced features of visual Java.
• Plugged in bits of code in FigTree's source to make a new menu item that allows to execute the pruning algorithm and send the output trees straight to the display.
• Found other instances where the new pruning mechanism calls two trees equal when they are not. This is good because previously there was only one instance of this so I wasn't sure if it's an error in the code or a concept.
• Moved the above implemented functionality to a simpler template. Mastodon probably won't need all the tools that are available in FigTree.

Week 8 - Midterm evaluation

Planned

• Implement a stand-alone Mastodon graphical application.
• Implement a simulated annealing pruning algorithm.
• Implement workflow:
• select tree set
• run bisection algorithm to find a reasonable number of taxa to prune
• options to run a MH or SA algorithm (exploratory or aggressive)
• manual pruning options

A sketch of the app and the workflow:

Completed

• Preliminary simulated annealing implementation.
• Functional GUI.

Week 9

2nd deliverable

Planned

• Modify the algorithms interface to make it more modular.
• Make user interface more intuitive. (probably means getting rid of list of runs)
• Allow the user to specify what combination of algorithm and pruning number finder to use.
• Aim to deliver a program with native tree visualization and variation exploration capabilities.

Completed

• Rewrote the algorithm interface to factor out some of the repeated code.
• Changed algorithm selection dialog to allow any combination of linear/constant/bisection searching and Metropolis Hastings /Simulated annealing algorithms.
• Added 3 tree coloring buttons: none, pruned, pruning frequency.
• Changed results table to list all taxa together with their pruning frequencies.
• Other bits and pieces, like table sorting, a selected taxon in the table highlights in the tree and bugfixes.
• Created a runnable Mastodon App

Week 10

Planned

• Get some feedback on how user friendly the app is and change it accordingly.
• Implement a pruned tree output where the pruned taxa are actually removed from the tree, not just highlighted.
• Implement some manual pruning tools.
• Investigate how the pruning algorithms behave. They don't seem to be handling large datasets quite right.
• Add burn-in and data set sampling options.
• Make a mac and pc executable app.
• Support for unrooted trees.

Completed

• Implemented manual pruning. You can select a taxon or a whole clade and chop it off or add back. MAP score gets updated immediately. Coloring now colors an entire clade if all its tips are pruned, not just the tips.
• Added ability to set Burn-in when opening a new data set. No Sampling yet, though I think burn-in is more useful anyway.
• Added a popup that comes up after loading a data set and tells the user how many trees, taxa and unique clades were found.
• Added a "commit" (a bit misleading because no information is lost) button. This takes the taxa marked for pruning and removes them from all trees and brings up this new tree set in a new window. The user can continue pruning the subtree further as with a full tree.
• Adapted Andrew's build file (aside from readme's etc.) to build executable applications of MASTodon for Mac/Windows/Linux. Mac works, Windows probably does, not sure about Linux because I don't really understand the shell script it uses during building.
• Started deleting unused classes and libraries.
• Implemented a Rooting option. It takes a taxon from the tree set that the user specifies and roots all trees using it as an outgroup.
• Worked on lots of bits and pieces to make the user experience smoother.
• Added a "Weighting power" option to MH algorithm. The power is used to modify the objective function: (newScore/oldScore)^p .... A higher power makes the algorithm more aggressive.
• Explored a bit how the algorithms behave, and they seem to be doing what one would expect for the most part. The Specific issues observed before:

MH sampling uniformly - I got this to change by exaggerating the new/old score ratio. This is actually what Karen did in the paper too, she mentions that in this case she's using MH as an optimization algorithm. SA sampling uniformly - a lower starting temperature does the trick. Clades with probability 1 getting pruned - usually, if you track up the tree, there will be a clade with a low probability. So the clade with probability 1 is in all trees, but probably moving around. Unfortunately the best thing our approach can do is cut the clade out altogether. I've put in place a manual pruning option, pruning and un-pruning these clades in some cases shows a large change in MAP score. Taxa with high pruning frequency not making into the final pruning set - not too sure about this one. There are also some taxa included that don't change the map score at all, or make no sense (like the outgroup in the rooted set). Perhaps it is worthwhile to add a post-processing step that follows some simple algorithm to go through the pruned set and optimize it a bit. Though this can also be done manually quite easily, unless there are a lot of tips.

Week 11

Planned

• Implement a new search algorithm: Flip Penalty.
• Update CLI version.
• Bunch of interface things.

Completed

• Added export of the tree view to lots of formats.
• Put in Undo/Redo buttons for manual pruning to guard against clumsy users. I did a tree-like undo history: if you do a bunch of actions, then undo a few and then do another action, this is the new top of the action stack.
• Modified windows and linux builds to work correctly.
• Many GUI improvements and tweaks.
• Spent a day to do a large chunk of the code cleanup/documentation.
• Implemented a flip-penalty algorithm partially. It prunes and unprunes taxa at random, but there is a penalty for pruning more taxa if it doesn't produce significant MAP score improvement. The idea is to find a balance between pruning count and MAP score. The choice functions haven't been quite chosen yet.

Updating the CLI version has been postponed because algorithms are still changing.

-> Week 12

Planned

• Polish user interface.

Completed

• Finished ~90% of the documentation and cleanup tasks.