MASTodon

# Author

Justs Zariņš from Latvia, a Computational Physics student at the University of Edinburgh, developed this application as part of Google Summer of Code 2012. Please don't hesitate to contact me with any comments at all. Parts of the code were provided by the main project supervisor Andrew Rambaut.

# Project

Abstract: To create “Mastodon”, a Java tool that works with already established open-source programs to summarize and visualize a large set of phylogenetic trees. Specifically, I am interested in implementing the tree pruning algorithms presented in Karen Cranston and Bruce Rannala's paper with an emphasis on good scaling with larger input sets.

Relevant projects:

## Resources

Source Code: Mastodon@GitHub

## Performance

Performance mostly depends on the type of data examined. Something with a lot of noise (unique clades) will take much longer to process than a data set where the clades only change positions.

A spreadsheet with some performance stats [some additional checks have since slowed down the algorithm]

## Application details / Developer manual

### What does the code do

The main purpose of MASTodon is to look for a common subtree in a large set of phylogenetic trees. The suitability of a subtree is measured with a MAP (maximum a-posteriori probability) score. For unweighted trees this is just the number of trees that contain the subtree divided by the total number of trees. If the trees are weighted, the MAP score is the sum of the weights of the trees that contain the subtree. The MAP tree here is the either the tree with highest weight or the tree that has the highest product of clade probabilities. MASTodon works by removing (pruning) some taxa from all trees and then comparing each against the MAP tree. This is the same as counting the trees that contain the subtree that is obtained if you remove the selected taxa from the trees. The process goes on picking different pruning combinations and finding the MAP score for each until some stopping criteria is achieved. In addition to the core functionality, MASTodon also provides a Command Line Interface and a Graphical User Interface to allow deeper analysis and search for common subtrees. The source code is freely available and has facilites to implement additional pruning algorithms.

### How pruning works

The important classes here are BitTreeSystem, BitTree and Clade.

When loading trees into MASTodon they must first be parsed as Tree objects as implemented by JEBL. Then the addTrees method in BitTreeSystem converts to a format that MASTodon likes to work with. First, a LinkedHashSet (to preserve order) is used to store all unique Taxon objects that are found in the trees. The taxa are later referred to by their position in the order. Then the trees are converted to BitTree object. A BitTree represents a tree as a list of clades with size bigger than 1 leaf. This is stored as a list of BitSet where the bits that are 1 match the positions of the initially created Taxa set. So if you have taxa A,B,C and a clade B,C, then the clade is represented as 0,1,1.

Another thing that is created during adding of trees is a Map of all unique Clades. A Clade object has a BitSet representing it's taxa just like above, but it also keeps a BitSet that specifies in which trees (in the order they are added) this clade can be found. The BitTrees are built up of references to this central list.

After all trees have been processed in this way, the findMapTree() method can be called to find the tree with maximum sum (could also be product) of clade probabilities.

There are two ways to prune. One is to call prune(BitSet pruner) where pruner are the taxa that should be removed from the tree. This is a pretty basic method that removes the taxa from all clades in the central list, which automatically prunes all trees as well. Then an external score calculator needs to be called to calculate the pruning score. Finally, unPrune() should be called to restore the initial status of the trees.

The other method is the preferred way. pruneFast(BitSet pruner) again starts by pruning all the central clades, but then it focuses on the clade-to-tree relationships. It goes through all the clades of the MAP tree and intersects the clade-to-tree BitSets in order to eventually find which trees contain all of the (now pruned) MAP tree clades. However, a tree can contain all the MAP tree clades but some of its own too. To avoid this, the program then checks that the MAP tree and candidate tree have the same number of clades. If both these conditions are met, then the trees are equal and pruneFast also goes ahead to calculate the MAP score of this pruning and returns it. Again, unPrune() should be called afterwards.

### How the currently available algorithms are implemented

[k - number of taxa pruned]

Rather than explaining each algorithm completely (except for FlipPenalty), I will say something about the components and note in which algorithms they are used. There is a fair bit of overlap, hence possibility for more abstraction in a future update.

• Choosing how many taxa to prune

In linear algorithms you start at the initial k and move up in steps of one up to the final k. In bisection algorithms the starting point is the middle of the min and max k range. Some pruning is done at this k and a best score recorded. If the best score is more than the minimum MAP score, less taxa is pruned in the next round and vice versa if the score is less than the minimum. The jumps are done to the midpoint of an interval and in such a way that the interval always decreases, thus theoretically zoning in on the ideal pruning count. Example: |-----*-----| initial |--------*--| need to prune more |------*----| need to prune less [end of search because a half-way jump is less than 1]

• Distributing total iterations between steps

In bisection algorithms each step is given an equal share of the total iterations. The number of steps is ln(max k - min k) / ln(2) [rounded down to nearest integer]. In linear algorithms each step with pruning number k has a different number of iterations allocated. These are chosen proportional to the natural log of the binomial coefficients (totalNumberOfTaxa choose k).

• Choosing taxa to prune

Some initial random combination is present. The basic choosing mechanism is the same for both SA and MH algorithms as follows. The previous pruning set has produced a MAP score, we want to build on it. So, instead of generating a new random pruning combination, the previous one is perturbed. The number of taxa to perturb is chosen from a poisson distribution with mean of (k-1)/2. This number of taxa to perturb is removed from the current pruning set and then the same number of taxa is added from the remaining taxa. Thus you won't have removing A from the pruned set only to have it immediately added back in. Note: when going up in k (Linear algorithms) the previous pruning is also preserved and simply one extra pruned. However, in bisection algorithms for each new k (after a jump) a new initial random combination is generated.

• Scoring and moving

The maximum score is always stored. If there are more than one pruning combination that produce that score, they are saved as well. Moving to a new point for MH: if (currentScore/previousScore)^power > uniform(0,1) Moving to a new point for SA: if exp((currentScore - previousScore)/currentTemperature) > uniform(0,1)

• Temperature and cooling

The cooling schedule is exponential, so currentT=rate * previousT. rate = (finalT/initialT)^(1/allocatedIterations) This ensures that the temperature will fall from the initial temperature to the final one within the allocated iterations. The initial temperature and cooling rate is reset each time k changes.

• End

All algoritms finish when the totalIterations limit is reached. Linear algorithms stop as soon as the minimum MAP score is reached. The pruning that produced the highest MAP score is saved as the final pruning.

• FlipPenalty algorithm*

This algorithm works a bit differently so has its own section. Each iteration a random taxon is chosen and its pruning status flipped. Then a score is calculated where an increase in k penalised while an increase in MAP score is rewarded. This is compared against uniform(0,1) to decide whether to accept the pruning. The idea is to find a balance between k and MAP score. The exact function to use in deciding the penalty is not fully decided yet. Right it is exponential.

### Overview of packages and classes

• figtree.*

Contains imported code that handles displaying of the Tree View panel.

• mastodon

Contains the main GUI classes. Large components are separated into their own classes, for example TopToolbar and Algorithm Dialog. The various menu factories deal with the drop-down menus of the application. MastodonApp contains the main entry point for the GUI. MastodonFrame handles all communication between components, graphcis and algorithms. It's probably doing too much so there is opportuniry to refactor here.

• mastodon.algorithms

Contains implementations of various pruning algorithms. Some redundancy can be found here, possibility for another level of abstraction.

• mastodon.core

Contains classes for the core pruning code, algorithm interface, RunResult object and tree I/O. There is also a Command Line Interface entry point. Launcher is the link between the GUI and algorithms. It could possibly be used by the CLI as well. GUIInputVerifier is a rather hard-coded input verifier for the currently implemented algorithms.

• mastodon.graphics

This package is for some of the more interesting visualizations. For now there is only a pruning frequency heat-map class here, it didn't seem very useful at the time of development.

• mastodon.scoreCalculators

Initially score calculation was separate from pruning, but now the main pruning method (pruneFast) includes score calculation.

• mastodon.tests

Some early tests for MASTodon. Not JUnit tests, more manual ones. Most focus was shifted on the GUI testing later in the project.

• mastodon.util

Some utilites used by the imported application code.

### How to implement an algorithm

To implement a pruning algorithm in MASTodon it's best to follow the pattern of the existing ones. An algorithm has to extend the Algorithm abstract class. There are a number of methods and variables that should be implemented. They are either self explanatory or documented in the code.

Setting limits is done by passing a <String, Object> Map. In general you can name the input variables however you want, but there are certain keywords (like minMapScore) that the GUI implementation expects. See inputVerifiers and the algorithm related methods in MastodonFrame.

An algorithm follows the pattern:

initialize() - set up everything needed for execution of the algorithm

the algorithm then loops until either the finished() method returns true or forceStop has been set to true from outside (like the user pressing the "cancel" button)

within the algorithm each iteration starts with choosePruningCount(), then tryPruning() which generates a new combination of taxa to prune and computes the MAP score that the combination gives. Finally, setNewBest() has to decide whether this new pruning is better or worse and what to do with it.

after the iteration loop is exited, afterActions() is called for any final touches. One thing that should probably be done is setting of the finalPruning variable.

After an algorithm has been successfully executed, you may run getRunResult() to obtain a RunResult object that contains the needed data for the GUI.

To add an algorithm to the GUI popup algorithm chooser, see AlgorithmDialog, inputVerifiers and the algorithm methods in MastodonFrame.

### A bit more about BitTrees

Say you have a set of trees:

tree1 = ((A,(B,D)),(C,(F,E)));

tree2 = ((A,B),(D,(C,(F,E))));

tree3 = (((A,D),B),(C,(F,E)));

tree4 = (((A,B),D),((C,F),E));

First create a list of all unique taxa (order must be maintained): [A,B,C,D,E,F]

BitSet - essentially an array of true/false or 1/0 values.

A clade can be represented by setting (making equal to 1) the bits in a BitSet that correspond to the list of unique taxa. So a clade (A,(B,D)) will be [110100]. Note that the structure of the clade is not specified, only the taxa it contains. To specify the nesting of clades, a list of all sub-clades needs to be made.

For example, (A,(E,(B,D))) is completely described by: {[110110], [010110], [010100]}

This is called a BitTree in MASTodon. The actual BitTree object can hold other needed information about the tree.

# Timeline

This is the main working part of the wiki page. The following sections will be updated regularly as the project unfolds. The past, present and future collide: details of what has happened in previous weeks and what is planned for the next ones will be added, along with current developments.

## Community bonding period

• April 23 - April 27
• Signed up for the wg-phyloinformatics mailing list
• Met up with Andrew to discuss logistics for the summer
• April 30 - May 4
• Created wiki page for project
• Created GitHub code repository
• May 7 - May 11
• Had a G+ meeting with the mentors, Andrew, Karen and Ben
• Periodic meetings will be held on Friday mornings at about 09:15EST / 14:15BST
• Main points of meetings will be summarized in this wiki page. As is being done here
• Decided to use a circle on G+ for project discussions
• May 14 - May 18
• Set up the work computer. Installed Eclipse and the GitHub tool for commiting etc
• Created Eclipse project for Mastodon and tested that committing to the repository works
• Improved the wiki page with better formatting and more details

## Week 1

Planned

• Obtain datasets for testing
• a small (about 10 trees) for quick testing
• a large (about 10000 trees) for testing how large input affects performance
• (maybe) one of the original datasets used in Karen's paper to compare results
• Incorporate JEBL. The goal is to have a framework where handling tree data is easy and universally available throughout the program, perhaps with an imported class that contains all necessary methods. Most important classes from JEBL seem to be NexusImporter.java, NexusExporter.java for I/O and Tree.java along with implementations to store tree data, for example RootedTree.java
• Explore if trees should be stored as an object in memory or, if the trees are large or many, serialize and store on the hard drive.
• Write a test to import the small set and export it again. Check consistency.

Completed

• Obtained Karen's carnivore data.
• Made a class for reading in Nexus trees into memory. Trees can be stored as mutable, immutable or compact rooted trees as defined in JEBL.
• Made a class for writing Nexus trees to file.
• Wrote simple reading/writing and consistency checking test cases.

## Week 2

Planned

• Implement a simple algorithm interface. This would mandate that an algorithm implementing this interface:
• accepts a set of trees as input
• has a run method
• has an output method that returns a list of pruned taxon and the resulting trees and their MAP scores
• Write a MAP score calculator (needed for algorithms) for a set of trees.
• Implement performance measurements.
• Output results in a CSV file.
• Check for overall running time and see approximately how it scales with the size of input. More importantly, record intermediate measurements in equal increments (say, after the processing of each tree) to detect if there is a slow-down as the program keeps running.
• Start implementing the MH algorithm.

Completed

• Extended the tree reader to also support Newick files.
• Implemented a simple algorithm interface.
• Wrote a MAP score calculator for a set of trees. Found out that I'll need more than one version, e.g. weighted/unweighted trees.
• Started implementing the MH algorithm. Made a working version but not quite ready to handle real data.
• Extra: largely implemented the "BitTree" framework for storing and manipulating trees as BitSet objects. This should help with handling large sets of trees.

## Week 3

Planned

• Finish up the simple tree MH algorithm implementations for weighted trees.
• Work on an implementation of the above for BitTrees.
• Clean up some of the lingering oddities in the BitTree code.
• Test MH algorithm with the small set.
• Check performance with the large set.

Completed

• Fleshed out the initial implementation of the MH algorithm using Tree objects.
• Implemented the algorithm using BitTrees. I'll focus on using BitTrees from now on as they are easy to work with and have performance benefits.
• Tested the program with a 10k tree data set. It appears stable, no odd slow-downs or memory leaks. Unsure if the solutions are being found optimally.

## Week 4

Planned

• Create a simple command line interface that collects user input (set of trees, threshold variables), runs algorithm and outputs an annotated NEXUS most-supported tree along with a text list of pruned taxon.
• Reduce memory usage by piping through data that doesn't need to be stored.
• Hammer bugs and refactor.

Completed

• Completed a first version of the program and put in the shared DropBox folder. It can search a set of trees for a common subtree. Output is a text file that lists pruned taxa together with the subtree's score and number of matching subtrees in the set, along with an annotated nexus file that colors the path from pruned leaf down(or is it up?) to the root when opened in FigTree.
• Implemented piping in data in batches as it was possible to do everything required without holding the whole set of trees. At the moment I'm piping in a 100 at a time which seems to work alright, a set with 1441 tips hovers at around 200MB of memory which seems acceptable.

Bits and pieces around the code:

• Reverted to using SimpleRootedTree instead of MutableRootedTree everywhere. There was some shady code I had to add to MutableRootedTree for it to work, so using SimpleRootedTrees again removes one possible source of pain.
• Found a redundancy that made turning the Tree objects to BitTrees faster.
• People on the internet complain about iterating through a Map using .keySet() so I changed to using .getEntries() as recommended. Not too much performance gain here though.
• Dropped the classes that dealt with plain Tree objects and finally merged the Experimental branch back into the master branch.
• Filled in most of the missing JavaDoc.

## Week 5

1st deliverable - a working MASTodon

Planned

• Continue working on the user application. Specifically, implement a version where all input can be given through flags at the time of launching the program. This will enable the user to put the program as a part of a workflow.
• Write some user documentation/instructions and attach to the executable program.
• Some areas still remain where performance can be squeezed out. At the moment the program scales something like a quadratic with the number of leaf nodes.

Completed

## Week 6

Planned

• Do a bit of code cleanup, need to merge in the new tree comparison algorithm.
• Package and upload MASTodon v0.2
• Explore how to collect data about which taxa are pruned most frequently and see if there is good material here for a heatmap visualization.

Completed

• Improved the pruning list perturbation code. It didn't always alter the correct number of taxa as prescribed by the Poisson distribution.
• Fixed a maths error that caused bias to swap out taxa at the beginning of the list. Made it uniform across the whole list.

The two above have made the search algorithm converge better as it moves more freely across the space of taxa.

• Deleted a bunch of dead code.
• Put in some code that collects information about pruning combinations that produce a local improvement in the map score. Used this to display a basic heat map of pruning combinations of 2 taxa.
• Added support for weighted trees.
• Implemented a bisection search algorithm. It takes as inputs the target map score and how many iterations to stay in each step. Starting out at half of the taxa pruned, it moves in half jumps to find the number of taxa that need to be pruned in order to obtain the desired map score and then stays there until the target map score is achieved. Seems quite useful for Andrew's virus data as quite a lot of taxa have to pruned away there to see any sort of skeleton tree emerge.
• Found and fixed a bug that shows up when many taxa are being pruned as a bonus.
• Found a bug where a tree is declared equal to the map tree when it's not. Put in a fix that deals with this. Need to investigate why this seems to happen only rarely.
• Learned how to use the Eclipse debugger.

## Week 7

Planned

Start working on visual features.

• GUI for user input, allow to run the program multiple times without rebuilding the tree set.
• Integration with FigTree for results.
• Click-Pruning.

Completed

• Spent some time learning about the more advanced features of visual Java.
• Plugged in bits of code in FigTree's source to make a new menu item that allows to execute the pruning algorithm and send the output trees straight to the display.
• Found other instances where the new pruning mechanism calls two trees equal when they are not. This is good because previously there was only one instance of this so I wasn't sure if it's an error in the code or a concept.
• Moved the above implemented functionality to a simpler template. Mastodon probably won't need all the tools that are available in FigTree.

## Week 8 - Midterm evaluation

Planned

• Implement a stand-alone Mastodon graphical application.
• Implement a simulated annealing pruning algorithm.
• Implement workflow:
• select tree set
• run bisection algorithm to find a reasonable number of taxa to prune
• options to run a MH or SA algorithm (exploratory or aggressive)
• manual pruning options

A sketch of the app and the workflow:

Completed

• Preliminary simulated annealing implementation.
• Functional GUI.

## Week 9

2nd deliverable

Planned

• Modify the algorithms interface to make it more modular.
• Make user interface more intuitive. (probably means getting rid of list of runs)
• Allow the user to specify what combination of algorithm and pruning number finder to use.
• Aim to deliver a program with native tree visualization and variation exploration capabilities.

Completed

• Rewrote the algorithm interface to factor out some of the repeated code.
• Changed algorithm selection dialog to allow any combination of linear/constant/bisection searching and Metropolis Hastings /Simulated annealing algorithms.
• Added 3 tree coloring buttons: none, pruned, pruning frequency.
• Changed results table to list all taxa together with their pruning frequencies.
• Other bits and pieces, like table sorting, a selected taxon in the table highlights in the tree and bugfixes.
• Created a runnable Mastodon App

## Week 10

Planned

• Get some feedback on how user friendly the app is and change it accordingly.
• Implement a pruned tree output where the pruned taxa are actually removed from the tree, not just highlighted.
• Implement some manual pruning tools.
• Investigate how the pruning algorithms behave. They don't seem to be handling large datasets quite right.
• Add burn-in and data set sampling options.
• Make a mac and pc executable app.
• Support for unrooted trees.

Completed

• Implemented manual pruning. You can select a taxon or a whole clade and chop it off or add back. MAP score gets updated immediately. Coloring now colors an entire clade if all its tips are pruned, not just the tips.
• Added ability to set Burn-in when opening a new data set. No Sampling yet, though I think burn-in is more useful anyway.
• Added a popup that comes up after loading a data set and tells the user how many trees, taxa and unique clades were found.
• Added a "commit" (a bit misleading because no information is lost) button. This takes the taxa marked for pruning and removes them from all trees and brings up this new tree set in a new window. The user can continue pruning the subtree further as with a full tree.
• Adapted Andrew's build file (aside from readme's etc.) to build executable applications of MASTodon for Mac/Windows/Linux. Mac works, Windows probably does, not sure about Linux because I don't really understand the shell script it uses during building.
• Started deleting unused classes and libraries.
• Implemented a Rooting option. It takes a taxon from the tree set that the user specifies and roots all trees using it as an outgroup.
• Worked on lots of bits and pieces to make the user experience smoother.
• Added a "Weighting power" option to MH algorithm. The power is used to modify the objective function: (newScore/oldScore)^p .... A higher power makes the algorithm more aggressive.
• Explored a bit how the algorithms behave, and they seem to be doing what one would expect for the most part. The Specific issues observed before:

MH sampling uniformly - I got this to change by exaggerating the new/old score ratio. This is actually what Karen did in the paper too, she mentions that in this case she's using MH as an optimization algorithm. SA sampling uniformly - a lower starting temperature does the trick. Clades with probability 1 getting pruned - usually, if you track up the tree, there will be a clade with a low probability. So the clade with probability 1 is in all trees, but probably moving around. Unfortunately the best thing our approach can do is cut the clade out altogether. I've put in place a manual pruning option, pruning and un-pruning these clades in some cases shows a large change in MAP score. Taxa with high pruning frequency not making into the final pruning set - not too sure about this one. There are also some taxa included that don't change the map score at all, or make no sense (like the outgroup in the rooted set). Perhaps it is worthwhile to add a post-processing step that follows some simple algorithm to go through the pruned set and optimize it a bit. Though this can also be done manually quite easily, unless there are a lot of tips.

## Week 11

Planned

• Implement a new search algorithm: Flip Penalty.
• Update CLI version.
• Bunch of interface things.

Completed

• Added export of the tree view to lots of formats.
• Put in Undo/Redo buttons for manual pruning to guard against clumsy users. I did a tree-like undo history: if you do a bunch of actions, then undo a few and then do another action, this is the new top of the action stack.
• Modified windows and linux builds to work correctly.
• Many GUI improvements and tweaks.
• Spent a day to do a large chunk of the code cleanup/documentation.
• Implemented a flip-penalty algorithm partially. It prunes and unprunes taxa at random, but there is a penalty for pruning more taxa if it doesn't produce significant MAP score improvement. The idea is to find a balance between pruning count and MAP score. The choice functions haven't been quite chosen yet.

Updating the CLI version has been postponed because algorithms are still changing.

## -> Week 12

Planned

• Polish user interface.

Completed

• Finished ~90% of the documentation and cleanup tasks.