PhyloSoC:Biodiversity Conservation Algorithms and GUI

From Phyloinformatics
Jump to: navigation, search

This project is part of the 2007 Phyloinformatics Summer of Code which is part of the Google Summer of Code project. This web page will serve as the central resource for information relation to the project "Biodiversity conservation algorithms and GUI" that is being developed by Klaas Hartmann.

News

Project Overview

Klaas will implement various algorithms that utilise phylogenetic information to prioritise species for biodiversity conservation. A GUI will also be developed allowing these algorithms to be utilised by conservation managers. The overall goal is to provide a package that brings together as many existing algorithms and methods as possible and provides an interface between mathematical results and their intended final audience.

More detail at: SoC Application

Appropriate bio* package:

bioPerl is the package stated in the project proposal. Whilst this is most likely to remain the final choice it is worth reconsidering this briefly here as the final decision was a bit rushed when compiling the original project proposal.

  • Yes, and this was probably also partially influenced by my perception that Rutger would play a more significant role in this project than it appears he will now. Also, BioPerl is the most advanced Bio* toolkit and BioJava doesn't offer much functionality or object models to assist with this project, so if the project was to be written in Java then potentially it would utilise only a limited set of existing functionality from BioJava. It is unclear to me how widespread changes to the project plan the Google SoC rules allow at this point in the project. --Tobias.thierer 16:44, 8 May 2007 (EDT)
  • I imagine if we had compelling reasons to choose a different Bio* package it would be okay as this is only a change of the project within the mentoring organisation -- especially if Hilmar and others in NESCent were happy for us to do this. If we wanted to utilise Mesquite or Geneious this might be a very different matter (I think this is probably an unlikely choice due to this and license issues for Geneious). --Klaas 17:50, 8 May 2007 (EDT)

bioPerl

advantages

  • most comprehensive bio* package
  • Rutger's Bio::Phylo already implements many useful indices (which would otherwise need to be implemented)
  • Rutger is very familiar with this language

disadvantages

  • Perl has a tendency to encourage unclear code (Klaas' opinion)

bioPython

advantages

  • Python (especially with scipy) is great for scientific computing (Klaas' personal opinion)
  • Klaas has some experience creating cross-platform stand-alone GUIs with Python and wxPython
  • Tobias has only about 2 hours experience with Python (writing one script of probably less than 100 lines)

disadvantages

bioJava

advantages

  • Tobias is very familiar with Java
  • Less need to worry about platform cross compatibility

mesquite

This was suggested to Klaas by Arne Mooers

  • I don't have much experience with Mesquite myself but have heard fairly negative comments about it at the Phyloinformatics hackathon. I think the main point of criticism was that the GUI was overloaded and busy with too many options in an unstructured way. I'm not sure if we want to follow that path. If we went for Java then my personal preference would probably be Geneious (which also has an open API and everyone with the free Geneious Basic could run your plugin), but of course I'm 100% biased because I work for Biomatters and know the Geneious API very well. --Tobias.thierer 16:58, 8 May 2007 (EDT)
  • I agree with your comments regarding Mesquite. Utilising the Geneious GUI might prove difficult under this SoC project -- it doesn't really fall under the Bio* umbrella? Also Geneious license probably disqualifies it as a candidate for Google SoC work. I will have a play with the Geneious interface anyway --Klaas 17:40, 8 May 2007 (EDT)
  • You are right, given that we are in the NESCent group and under the Bio* umbrella this probably won't work. The plugin itself could be open source even if Geneious isn't, but it is true that this may also cause problems. I was aware of this already, but for some reason I didn't consider it when I wrote the comment above; I'd say you can ignore my comment as it is probably not worth pursuing. --Tobias.thierer 21:24, 8 May 2007 (EDT)

advantages

  • Existing GUI
  • Existing user base
  • Cross platform compatible (JAVA)

disadvantages

  • User base does not really overlap with target audience
  • GUI too extensive and complicated -- our features would be buried in it

Klaas' concluding thoughts

The existing code in the bio* packages that will effect this project are:

  • Tree objects
  • Routines for loading trees from files/databases
  • Routines for displaying trees/creating graphical representations
  • Existing implementations of indices (Bio::Phylo)

The extent to which these features are available in the Bio* packages is something I have yet to fully investigate (perhaps Rutger or Tobias have some ideas and could list these under advantages/disadvantages)?

  • BioJava doesn't have any of these but JEBL (Java Evolutionary Biology Library) can read and write trees from/to Nexus files, and has a fairly advanced tree viewer GUI component (I think this alone took longer to develop than this entire project aims to). I'm not entirely convinced of the JEBL tree model (it appears hard to create a good tree object model, at least in Java), but it shouldn't be too hard to write code for. I'm not sure what you mean with those species indices - I'll try calling you again in your office later today and we can discuss. --Tobias.thierer 16:49, 8 May 2007 (EDT)
  • It would be great to have a tree viewing component that can just be dropped in place. Species indices are species specific measures that measure the relative biodiversity of a species. Examples include Equal Splits, Pendant Edge and Quadratic Entropy. They are all pretty easy to code up (probably knock them off in a day). --Klaas 17:38, 8 May 2007 (EDT)

The features of the programming languages themselves that are desirable are:

  • A good GUI library
  • Interface to C/C++
  • Cross platform binary compilation (except Java)
  • Although that is possible, Perl also doesn't need cross platform binary compilation, does it (I think it's ok to require a perl interpreter)? Besides, for Java it is also possible to create installers (e.g. with install4j) that bundle the JRE and therefore appear just like a native application to the end user. --Tobias.thierer 16:54, 8 May 2007 (EDT)
  • I was wanting to get away from requiring a Perl interpreter. However if we require a Perl installation (I believe OS X has one by default?) then PAR can be used to create a platform independent jar style archive with all necessary dependencies. However I don't want to create any barriers that may discourage users that just want to have a quick play for example -- requiring a Perl installation is certainly a barrier. To this end PAR should also be able to produce stand-alone binaries. I was hoping for the resulting program to be fairly compact in size and installation to be dead simple. The latter seems true for install4j but the size seems substantial (15MB extra for bundling Java judging by Geneious). --Klaas 17:38, 8 May 2007 (EDT)

As far as I can tell Python, Perl and Java all have several options for doing these things.

Given my personal interests and skill base my preference is bioPython followed by bioPerl. However there is some overlap between this project and Bio::Phylo, together with the likely further future overlap between these projects this provides a compelling reason to stick with bioPerl (despite my Python preference).

GUI Options

NB: This section is largely irrelevant if bioJava were chosen

The algorithms implemented in this project are aimed at two distinct user groups. The first group consists of researchers that are actively involved in developing, testing and comparing algorithms in this field. Most of these users will be accessing functions directly and not using the GUI. They will therefore need to have Perl and bioPerl installed anyway.

The second user group are conservation managers and other ecologists who simply want to apply these methods to some dataset. These users will want something that works out of the box with minimal effort -- they should not have to install Perl, bioPerl or anything else. The GUI should also work on all major platforms -- Windows, Linux, OS X (anything else that should be added to this list eg. FreeBSD?).

To this end there are two main options that can be pursued: creating binaries for all target platforms or creating a web interface with the calculations being performed on the server.

Creating Binaries

advantages

  • resulting code possibly faster
  • stable for long computations
  • program will remain available and functional in it's last form as long as the bio* website still operates.

disadvantages

  • producing binaries for several platforms may be a pain
  • will still omit some platforms (but users of strange platforms will presumably be used to compiling things from scratch and installing programs like Perl)
  • possible platform dependent bugs
  • unsure if incorporating C/C++ code will be problematic

Options for producing binaries

Web interface

Options include XUL::node (requires Firefox!!!) and Google GWT Toolkit.

advantages

  • full control over the environment in which the algorithms run
  • relatively simple interface construction

disadvantages

  • need to find a server to run this program on
  • the program may eventually stop working when development is no longer active and the server is upgraded/changed
  • if the program actually gets used this may cause excessive load on the server (particularly for large problems which may take days to solve)
  • possible stability issues for problems that take a long time to solve
  • less control over the interface design

Klaas' concluding thoughts

Creating binaries seems the most reliable method to me and provided all works well will provide a relatively painfree experience for the end user. If we identify this as a desirable option I will try building a simple GUI which calls C/C++ code for multiple platforms prior to writing any real code and commiting to that option.

Gateway to other code

Some algorithms have been and continue to be developed in C/C++. It would therefore be useful to have an interface from the package to C/C++ code. Options for doing this include SWIG and Inline::C.

Advantages

  • may not need to implement some algorithms in Perl
  • if Perl versions of all algorithms are implemented comparisons with their C/C++ counterparts will prove useful
  • C/C++ algorithms may be faster for large problems
  • further work done by others in C/C++ can easily be incorporated

Disadvantages

  • Need to ensure the C/C++ code is suitably compiled for all target platforms
  • May cause problems with cross-platform binaries for the GUI

Timeline of Goals for Klaas

I have modified the timeline from the original outline as I think it is important to get a semi-functional preview of the GUI working earlier. This will allow the GUI to be reviewed by some of the end users. The timeline is relatively ambitious -- according to the timeline the core functionality I want to produce from this project will be implemented by the half way point. Some of the goals mentioned after this point are optimistic and may not be realistically implementable.

Weeks < 1

Goals

  • Prior to project start an appropriate GUI development package and cross platform compilation system should have been determined
  • Klaas tells me that wxPerl looks like the most promising widget/GUI library,and that he wants to use PAR for compilation. Some of the Perl packages he intends to use contain C code which needs to be compiled specifically for each platform; at the end of the project we will need to create specific installers for each platform (Windows/MacOS X/Linux). --Tobias.thierer 00:03, 30 May 2007 (EDT)

Outcome

  • wxPerl provides the most OS consistent interface and there are several programs that utilise PAR to provide a standalone GUI. Although these are mostly windows focused this should also be possible on Linux.
  • GD::Graph will be used for producing graphs (for comparing indices etc.)
  • PAR will be used to create a stand alone program. This basically creates a zip file containing a perl interpreter and all dependencies for that program. The final size should be between 5 and 10MB. A development system for each OS for which this will be distributed is required. Definitely want to target WinXP and linux (I have build environments for both). OS X may have a default perl installation with sufficient libraries to make a simpler approach possible, otherwise I'll have to create an OS X build environment too.
  • Bio::Phylo provides most of the phylogenetic structure necessary.
  • Bio::Tree::Draw should provide sufficient tree drawing capabilities (not a hugely important aspect of this program)
  • wxGlade will be used for designing the GUI. This is free, runs under multiple OS's and generates Perl code for the GUI.

Week 1

Goals

  • Determine how to integrate the project with BioPerl and Bio::Phylo
  • Implement a mock up of the GUI
  • Familiarise myself with the google code repository (I have not used versioning systems before)
  • Evaluate the data structure currently used to represent trees, a more efficient data structure may be required for some algorithms

Outcome

  • I've gained some familiarity with the Bio packages
  • Bio::Phylo::Treedrawer will be used for drawing trees instead of Bio::Tree::Draw, however this package requires a little work to resolve some bugs
  • The mock up of the GUI is well underway and I am starting to get a good understanding of the WxPerl toolkit
  • I am starting to use the Google svn
  • For clarity I will initially program the algorithms using the Bio::Phylo::Forest::Tree object. For most applications/algorithms this should be sufficiently fast. For those algorithms that require better efficiency they should probably be implemented in C/C++ anyway. (A gateway to C/C++ algorithms is part of this project)

Week 2

Goals

  • Implement any species specific indices not already included in the package
  • Produce the C wrapper for existing algorithm implementations
  • Contact Fabio regarding making his code compatible with the C wrapper (it presently reads data from files and outputs to stdout)
  • Resolve issues with Bio::Phylo::Treedrawer

Outcome

NB: I decided to delay the interface to Fabio's algorithm as he will be visiting Christchurch in the next week, so I will contact him then.

  • Produced a patch for wxGlade (a tool I am using for GUI design). It was not producing correct source code for perl menus.
  • Fixed some issues with Bio::Phylo::Treedrawer. Will shortly pass these back to Rutger for inclusion.
  • Created a Wx drawing class for Bio::Phylo::Treedrawer.
  • Set up a development environment for Windows and succeeded in compiling Windows binaries.
  • Created a SurvivalProbability class that encapsulates linear survival probabilities to be assigned to species
  • Implemented the portion of the GUI for the SurvivalProbability class

Week 3

Goals

  • Implement greedy algorithms for the NAP
  • Create the GUI interface to the greedy algorithms

Outcomes

  • Further work on the GUI, I am starting to get a good handle on wxWidgets, hopefully further GUI programming will be faster.
  • I've started implementing the greedy algorithms
  • Presented a talk including this project at Evolution 2007 (Christchurch), some interest was expressed
  • Discussed inclusion of other algorithms with two with Fabio and Steffen at Evolution

Week 4

Goals

NB: This is a shortened week due to the Evolution 2007 conference.

  • Finish implementing greedy algorithms for the NAP
  • Preliminary interface of these with the GUI

Outcomes

  • Class for storing greedy solutions implemented, this handles problems regarding multiple equally good solutions
  • Implemented greedy algorithms for the NAP

Week 5

Goals

  • Incorporate Fabio's algorithm
  • Create useful algorithm output for the GUI

Outcome

  • Implemented XS Interface to Fabio's algorithm
  • Implemented Data structure for non-greedy solutions
  • Implemented XS Interface to wxGraphics* in wxPerl (primarily to learn XS, but wxGraphics* will prove useful for the GUI implementation)
  • Work on GUI delayed due to complications with some intricacies of XS

Week 6

Goals

  • Work on GUI front-end for algorithm output
  • Re-evaluate project goals -- it may be better to focus on implementing a smaller number of features in the GUI better than implementing the initial ambitious functionality and perhaps doing so somewhat poorly. This is also motivated by chatting with people at the Evolution conference -- there seems to be more interest in simple methods than more complicated alternatives.

Outcomes

  • Productive meeting with Tobias. We discussed the interface and have come up some ideas that I think will make it much more accessible.
  • Making good progress towards implementing the above ideas.

Mid term Evaluation (end Week 6)

Goals

  • Most of the algorithms should be completed and a preliminary GUI should be available

Outcomes

  • The algorithms are in place. Some final re-organisation is still necessary.
  • The GUI is still in early stages. My goal of having a mostly functional implementation at this point was probably a bit over ambitious (I was aware that the original timeline I proposed was ambitious).

Week 7

Goals

  • Respond to any mid term evalution comments
  • Continue GUI implementation

Outcome

The following outcomes are for greedy algorithms. The extension to non-greedy algorithms is straightforward and simply involves creating the appropriate hooks for that results data structure.

  • Basic solution analysis is implemented
  • Solution/species specific index comparison is nearly implemented

Week 8

Goals

  • Finish work on solution comparison panel
  • Create hooks for non-greedy solutions
  • Create solution generation gui panel
  • Update cost/benefit and tree selection panels

Outcomes

  • Solution comparison panels nearly finished but now I have a better idea for displaying the information
  • Cost/benefit tree selection panels reimplemented in a much snazzier way!
  • gdGraph was creating all sorts of problems so I created a line graphing class specifically for the project from scratch. The output is much improved.

Week 9

Goals

  • Create hooks for non-greedy solutions
  • Create solution generation gui panel

Week 10

Goals

  • Time permitting implement some basic sensitivity analysis


Week 11

Goals

  • Document project
  • Tidy up loose ends

Week 12

Goals

  • Document project
  • Tidy up loose ends

Beyond SoC

This work is of great interest to me and I will continue adding to this project. The following features are of particular interest:

  • Multiple tree algorithms
  • Conservation measures that are not species-specific
  • Sensitivity analysis

Literature References

The following papers contain most of the current results regarding the NAP that this project would seek to implement:

My work:

http://www.kavenga.com/klaas/papers/HartmannSteel2006_NAP.pdf

http://www.kavenga.com/klaas/papers/HartmannSteel_Chapter.pdf

Other work:

Moulton, V., Semple, C., Steel, M. Optimizing phylogenetic diversity under constraints. Journal of Theoretical Biology.

Minh, B. Q., S. Klaere, and A. von Haesler. 2006. Phylogenetic diversity within seconds. Systematic Biology in press.

Pardi, F. and N. Goldman. 2007. Resource aware taxon selection for maximising phylogenetic diversity. Systematic Biology in press.

Pardi, F. and N. Goldman. 2005. Species choice for comparative genomics: no need for cooperation. PLoS Genetics 1:e71.

Simianer, H., S.~Marti, J.~Gibson, O.~Hanotte, and J.~Rege. 2003. An approach to the optimal allocation of conservation funds to minimize loss of genetic diversity between livestock breeds. Ecological Economics 45:377--392.

Steel, M. 2005. Phylogenetic diversity and the greedy algorithm. Systematic Biology 54:527--529.


External Links

Original Proposal

Project blog

Source code