PhyloSoC:Biodiversity Conservation Algorithms and GUI

From Phyloinformatics
Revision as of 21:10, 29 May 2007 by Hilmar (talk)
Jump to: navigation, search

This project is part of the 2007 Phyloinformatics Summer of Code which is part of the Google Summer of Code project. This web page will serve as the central resource for information relation to the project "Biodiversity conservation algorithms and GUI" that is being developed by Klaas Hartmann.

News

Project Overview

Klaas will implement various algorithms that utilise phylogenetic information to prioritise species for biodiversity conservation. A GUI will also be developed allowing these algorithms to be utilised by conservation managers. The overall goal is to provide a package that brings together as many existing algorithms and methods as possible and provides an interface between mathematical results and their intended final audience.

More detail at: SoC Application

Appropriate bio* package:

bioPerl is the package stated in the project proposal. Whilst this is most likely to remain the final choice it is worth reconsidering this briefly here as the final decision was a bit rushed when compiling the original project proposal.

  • Yes, and this was probably also partially influenced by my perception that Rutger would play a more significant role in this project than it appears he will now. Also, BioPerl is the most advanced Bio* toolkit and BioJava doesn't offer much functionality or object models to assist with this project, so if the project was to be written in Java then potentially it would utilise only a limited set of existing functionality from BioJava. It is unclear to me how widespread changes to the project plan the Google SoC rules allow at this point in the project. --Tobias.thierer 16:44, 8 May 2007 (EDT)
  • I imagine if we had compelling reasons to choose a different Bio* package it would be okay as this is only a change of the project within the mentoring organisation -- especially if Hilmar and others in NESCent were happy for us to do this. If we wanted to utilise Mesquite or Geneious this might be a very different matter (I think this is probably an unlikely choice due to this and license issues for Geneious). --Klaas 17:50, 8 May 2007 (EDT)

bioPerl

advantages

  • most comprehensive bio* package
  • Rutger's Bio::Phylo already implements many useful indices (which would otherwise need to be implemented)
  • Rutger is very familiar with this language

disadvantages

  • Perl has a tendency to encourage unclear code (Klaas' opinion)

bioPython

advantages

  • Python (especially with scipy) is great for scientific computing (Klaas' personal opinion)
  • Klaas has some experience creating cross-platform stand-alone GUIs with Python and wxPython
  • Tobias has only about 2 hours experience with Python (writing one script of probably less than 100 lines)

disadvantages

bioJava

advantages

  • Tobias is very familiar with Java
  • Less need to worry about platform cross compatibility

mesquite

This was suggested to Klaas by Arne Mooers

  • I don't have much experience with Mesquite myself but have heard fairly negative comments about it at the Phyloinformatics hackathon. I think the main point of criticism was that the GUI was overloaded and busy with too many options in an unstructured way. I'm not sure if we want to follow that path. If we went for Java then my personal preference would probably be Geneious (which also has an open API and everyone with the free Geneious Basic could run your plugin), but of course I'm 100% biased because I work for Biomatters and know the Geneious API very well. --Tobias.thierer 16:58, 8 May 2007 (EDT)
  • I agree with your comments regarding Mesquite. Utilising the Geneious GUI might prove difficult under this SoC project -- it doesn't really fall under the Bio* umbrella? Also Geneious license probably disqualifies it as a candidate for Google SoC work. I will have a play with the Geneious interface anyway --Klaas 17:40, 8 May 2007 (EDT)
  • You are right, given that we are in the NESCent group and under the Bio* umbrella this probably won't work. The plugin itself could be open source even if Geneious isn't, but it is true that this may also cause problems. I was aware of this already, but for some reason I didn't consider it when I wrote the comment above; I'd say you can ignore my comment as it is probably not worth pursuing. --Tobias.thierer 21:24, 8 May 2007 (EDT)

advantages

  • Existing GUI
  • Existing user base
  • Cross platform compatible (JAVA)

disadvantages

  • User base does not really overlap with target audience
  • GUI too extensive and complicated -- our features would be buried in it

Klaas' concluding thoughts

The existing code in the bio* packages that will effect this project are:

  • Tree objects
  • Routines for loading trees from files/databases
  • Routines for displaying trees/creating graphical representations
  • Existing implementations of indices (Bio::Phylo)

The extent to which these features are available in the Bio* packages is something I have yet to fully investigate (perhaps Rutger or Tobias have some ideas and could list these under advantages/disadvantages)?

  • BioJava doesn't have any of these but JEBL (Java Evolutionary Biology Library) can read and write trees from/to Nexus files, and has a fairly advanced tree viewer GUI component (I think this alone took longer to develop than this entire project aims to). I'm not entirely convinced of the JEBL tree model (it appears hard to create a good tree object model, at least in Java), but it shouldn't be too hard to write code for. I'm not sure what you mean with those species indices - I'll try calling you again in your office later today and we can discuss. --Tobias.thierer 16:49, 8 May 2007 (EDT)
  • It would be great to have a tree viewing component that can just be dropped in place. Species indices are species specific measures that measure the relative biodiversity of a species. Examples include Equal Splits, Pendant Edge and Quadratic Entropy. They are all pretty easy to code up (probably knock them off in a day). --Klaas 17:38, 8 May 2007 (EDT)

The features of the programming languages themselves that are desirable are:

  • A good GUI library
  • Interface to C/C++
  • Cross platform binary compilation (except Java)
  • Although that is possible, Perl also doesn't need cross platform binary compilation, does it (I think it's ok to require a perl interpreter)? Besides, for Java it is also possible to create installers (e.g. with install4j) that bundle the JRE and therefore appear just like a native application to the end user. --Tobias.thierer 16:54, 8 May 2007 (EDT)
  • I was wanting to get away from requiring a Perl interpreter. However if we require a Perl installation (I believe OS X has one by default?) then PAR can be used to create a platform independent jar style archive with all necessary dependencies. However I don't want to create any barriers that may discourage users that just want to have a quick play for example -- requiring a Perl installation is certainly a barrier. To this end PAR should also be able to produce stand-alone binaries. I was hoping for the resulting program to be fairly compact in size and installation to be dead simple. The latter seems true for install4j but the size seems substantial (15MB extra for bundling Java judging by Geneious). --Klaas 17:38, 8 May 2007 (EDT)

As far as I can tell Python, Perl and Java all have several options for doing these things.

Given my personal interests and skill base my preference is bioPython followed by bioPerl. However there is some overlap between this project and Bio::Phylo, together with the likely further future overlap between these projects this provides a compelling reason to stick with bioPerl (despite my Python preference).

GUI Options

NB: This section is largely irrelevant if bioJava were chosen

The algorithms implemented in this project are aimed at two distinct user groups. The first group consists of researchers that are actively involved in developing, testing and comparing algorithms in this field. Most of these users will be accessing functions directly and not using the GUI. They will therefore need to have Perl and bioPerl installed anyway.

The second user group are conservation managers and other ecologists who simply want to apply these methods to some dataset. These users will want something that works out of the box with minimal effort -- they should not have to install Perl, bioPerl or anything else. The GUI should also work on all major platforms -- Windows, Linux, OS X (anything else that should be added to this list eg. FreeBSD?).

To this end there are two main options that can be pursued: creating binaries for all target platforms or creating a web interface with the calculations being performed on the server.

Creating Binaries

advantages

  • resulting code possibly faster
  • stable for long computations
  • program will remain available and functional in it's last form as long as the bio* website still operates.

disadvantages

  • producing binaries for several platforms may be a pain
  • will still omit some platforms (but users of strange platforms will presumably be used to compiling things from scratch and installing programs like Perl)
  • possible platform dependent bugs
  • unsure if incorporating C/C++ code will be problematic

Options for producing binaries

Web interface

Options include XUL::node (requires Firefox!!!) and Google GWT Toolkit.

advantages

  • full control over the environment in which the algorithms run
  • relatively simple interface construction

disadvantages

  • need to find a server to run this program on
  • the program may eventually stop working when development is no longer active and the server is upgraded/changed
  • if the program actually gets used this may cause excessive load on the server (particularly for large problems which may take days to solve)
  • possible stability issues for problems that take a long time to solve
  • less control over the interface design

Klaas' concluding thoughts

Creating binaries seems the most reliable method to me and provided all works well will provide a relatively painfree experience for the end user. If we identify this as a desirable option I will try building a simple GUI which calls C/C++ code for multiple platforms prior to writing any real code and commiting to that option.

Gateway to other code

Some algorithms have been and continue to be developed in C/C++. It would therefore be useful to have an interface from the package to C/C++ code. Options for doing this include SWIG and Inline::C.

Advantages

  • may not need to implement some algorithms in Perl
  • if Perl versions of all algorithms are implemented comparisons with their C/C++ counterparts will prove useful
  • C/C++ algorithms may be faster for large problems
  • further work done by others in C/C++ can easily be incorporated

Disadvantages

  • Need to ensure the C/C++ code is suitably compiled for all target platforms
  • May cause problems with cross-platform binaries for the GUI

Timeline of Goals for Klaas

I will spend the first half of my time on the project implementing the various algorithms that have been developed in the literature. The second half of the project will be spent developing a GUI, documenting the project and developing a test suite.

Weeks < 1

  • Prior to project start an appropriate GUI development package and cross platform compilation system should have been determined
  • Klaas tells me that wxPerl looks like the most promising widget/GUI library. I have to ask him about compilation. --Tobias.thierer 08:07, 28 May 2007 (EDT)

Outcome Week < 1

  • wxPerl provides the most OS consistent interface and there are several programs that utilise PAR to provide a standalone GUI. Although these are mostly windows focused this should also be possible on Linux.
  • GD::Graph will be used for producing graphs (for comparing indices etc.)
  • PAR will be used to create a stand alone program. This basically creates a zip file containing a perl interpreter and all dependencies for that program. The final size should be between 5 and 10MB. A development system for each OS for which this will be distributed is required. Definitely want to target WinXP and linux (I have build environments for both). OS X may have a default perl installation with sufficient libraries to make a simpler approach possible, otherwise I'll have to create an OS X build environment too.
  • Bio::Phylo provides most of the phylogenetic structure necessary.
  • Bio::Tree::Draw should provide sufficient tree drawing capabilities (not a hugely important aspect of this program)
  • wxGlade will be used for designing the GUI. This is free, runs under multiple OS's and generates Perl code for the GUI.

--Klaas 23:58, 28 May 2007 (EDT)

Weeks 1

  • Determine how to integrate the project with BioPerl and Bio::Phylo
  • Determine how the GUI will be developed and where this fits in with BioPerl
  • Evaluate the data structure currently used to represent trees, a more efficient data structure may be required for some algorithms

Week 2

  • Implement any species specific indices not already included in the package (eg. Quadratic Entropy -> although I think there is Perl code for this somewhere)

Week 2,3,4,5

  • Implement algorithms for solving the NAP (various greedy and dynamic programming algorithms)

Week 6

  • Finalise implementation of algorithms and check their validity.
  • Plan the GUI and do a mock up

End week 6: Mid term evaluation

  • Most of the algorithms should be completed and a plan for the GUI should be available

Weeks 7,8,9

  • Implement the GUI
  • Distribute a preliminary version to peers for comment

Week 10

  • Act on comments from peers
  • Document the project

Week 11-12

  • Develop a set of test cases and test routines for the project
  • Tie up loose ends

Beyond SoC This work is of great interest to me and I will continue adding to this project as results become available.

Literature References

The following papers contain most of the current results regarding the NAP that this project would seek to implement:

My work:

http://www.kavenga.com/klaas/papers/HartmannSteel2006_NAP.pdf

http://www.kavenga.com/klaas/papers/HartmannSteel_Chapter.pdf

Other work:

Moulton, V., Semple, C., Steel, M. Optimizing phylogenetic diversity under constraints. Journal of Theoretical Biology.

Minh, B. Q., S. Klaere, and A. von Haesler. 2006. Phylogenetic diversity within seconds. Systematic Biology in press.

Pardi, F. and N. Goldman. 2007. Resource aware taxon selection for maximising phylogenetic diversity. Systematic Biology in press.

Pardi, F. and N. Goldman. 2005. Species choice for comparative genomics: no need for cooperation. PLoS Genetics 1:e71.

Simianer, H., S.~Marti, J.~Gibson, O.~Hanotte, and J.~Rege. 2003. An approach to the optimal allocation of conservation funds to minimize loss of genetic diversity between livestock breeds. Ecological Economics 45:377--392.

Steel, M. 2005. Phylogenetic diversity and the greedy algorithm. Systematic Biology 54:527--529.


External Links

Original Proposal

Project blog

Source code