Phyloinformatics Summer of Code 2007

From Phyloinformatics
Revision as of 04:12, 12 March 2007 by Rholland (talk)
Jump to: navigation, search

We are applying to the Google Summer of Code (GSoC) program for the first time this year. On this page we are collecting ideas, possible projects, prerequisites, possible solution approaches, mentors, other people or channels to contact for more information or to bounce ideas off of, etc


  • 21:20, 2 March 2007 (EST) Created page, added a couple of links, outline, and started filling in some bits. Hlapp


We believe that the goals, targets, and prior work of this Phyloinformatics working group make it particularly well suited as a mentoring organization for the GSoC program, for basically three reasons.

  1. The code that students will write will facilitate new and increasingly complex questions to be asked in comparative biology, one of the central disciplines in understanding the evolution of life. As part of its agenda, the Phyloinformatics group has already collected the use-cases from the research community that represent the most common and pervasive problems in phyloinformatics. Work on the suggested or similar projects is bound to make an impact. In addition, NESCent is committed to train end-users and scientific software developers in using the resulting work through summer courses and conference tutorials.
  2. The range of problems that students can make meaningful contributions to is diverse, enabling us to accommodate different areas of interest and skills. The participating toolkits, and the GSoC project ideas we have generated, cover a variety of programming languages and tasks, yet are all directed towards the same overall goal. A diverse group of mentors is on hand and can quickly be expanded to entire developer communities of the participating projects.
  3. Aside from gaining solutions to problems in phyloinformatics, we view our GSoC participation as an opportunity to gain future contributors to reusable open-source software components in phyloinformatics from the pool of future researchers in comparative biology. Once accepted, we will advertise this program through appropriate channels to reach undergrad and grad students interested in computational comparative biology. Some of the mentors can relate particularly well to students who are novices in research programming.


Enable multi-language bindings to the C++ NEXUS Class Library

The hackathon revealed that consistent behavior from NEXUS parsers is need in all of the Bio* toolkits as well as many of the primary analysis tools in phylogenetics. Rather than reimplement NEXUS parsing in each language, we could develop multiple language bindings to a common parsing library. NCL (the Nexus Class Library) is a flexible C++ library for supporting the parsing of NEXUS. Core tasks (such as tokenizing and reporting errors) are handled well, though not all public blocks have been handled. NCL is already used in several C/C++ programs (e.g. Brownie, TreeView) and is soon to be used by MrBayes4 and GARLI for NEXUS importing. The CIPRES NEXUS parser is also derived from a version of NCL.

Reusing this library could improve code-reuse in phyloinformatics and lead to more reliable behavior for tools that use NEXUS files in a pipeline.

SWIG is a system for writing C/C++ extensions to multiple scripting languages. By exposing the high level interface of NCL to scripting languages, the NCL can perform parsing and allow the scripting languages to simply extract the data they need. (The Inline architecture might also be used for easy perl bindings.)
NCL in written in portable C++ (using automake and autoconf for all non-windows platforms, and including hand-maintained build system for windows builds), but distributing code that depends on C/C++ libraries is always more difficult than pure-{python, perl, ruby, java} modules. Assuring a smooth platform-independent installation procedure will require work and lots of testing.
Involved toolkits or projects 
BioPerl, Biopython, Bioruby, SWIG, NCL

Evolve Unix phyloinformatics tools into Ajax applications

Many powerful new tools for phylogenetic stochastic grammar analysis of multiple alignments, such as xrate or PHAST, as well as other phyloinformatic tools like PAML, PHYLIP, GeneTree, HyPhy etc., are available only from the Unix command line. These tools need to become operable over the web, especially via Ajax platforms such as the new Google Maps-like interface to GBrowse.
Use toolkits such as dojo to build asynchronous javascript wrappers for Unix tools (probabilistic modeling & phlogeny tools, format conversion utilities, sequence analysis & alignment software, genome annotation pipelines, grids & job queues, realtime parallelizable systems); other Javascript/web components (alignment viewers, tree viewers & navigators, genome browsers); and bioinformatics "mashups". Interface with gmod-ajax, Amigo and other web-based bioinformatics platforms.
Adapting command-line tools to for web use; creating an asynchronous user interface; developing infrastructure for mashable bioinformatics -- see for more ideas/details
Involved toolkits or projects 
BioPerl/Biopython/Bioruby; SWIG; dojo; Sun Grid Engine; Erlang
Ian Holmes, Mitch Skinner, Chris Mungall, Jason Stajich

Extending the "make" paradigm for bioinformatics annotation pipelines

Annotating a genome, or performing other large-scale bioinformatics analyses, typically involves a series of operations with sequential dependencies but also strong parallelism. The "make" program is one robust approach that is often used to build such analysis pipelines, but suffers serious drawbacks for bioinformatics (e.g. no built-in database access; extremely limited pattern-matching; language is not extensible; dependencies are triggered only by file timestamps and not e.g. MD5 hash indicating file contents have changed).
The project will involve building a replacement or upgrade to "make". One possible approach will be to use a declarative language with (i) strong support for distributed processing, (ii) easy-to-use Unix "hooks" (c.f. make), (iii) database and filesystem access. Examples of candidate languages include Erlang and Termite Scheme. Alternatively, C-inclined students may start with an existing parallel "make" clone, such as qmake or distmake -- see for more ideas/details
The first challenge is to get something that is as convenient to use as "make" for migrating throwaway command-lines and analysis scripts into robust pipeline stages. Subsequent challenges will include database access, flexible pattern-matching and enhanced dependency triggers.
Involved toolkits or projects 
Erlang, Termite Scheme, distmake/qmake, or other.
Ian Holmes, Chris Mungall

AJAX widgets for phylo-informatics

The AJAX model offers exciting new opportunities for working with phylo-informatics data.
To build javascript widgets (visualization components with active hooks for clicking on various parts of the data structure) for structures such as phylogenetic trees and multiple sequence alignments.
The main challenge will be creating components that are responsively interactive, e.g. to change the order of nodes in a tree (or explode/implode nodes), and to design a clean interface to other js components.
Involved toolkits or projects 
Suzi Lewis, Chris Mungall

Create a usable phyloinformatics API for BioJava

Biojava has considerable sequence manipulation capability combined with distributions and good support for dynamic programming. It would be highly desirable to extend this capability to provide a phyloinformatics API.
Start with a simple object model that can represent phyloinformatics objects and concepts and provide basic I/O with common formats. Build on the experimental code that is currently in packages.
A workable, and elegant data model with a flexible I/O that is consistent with other biojava patterns. Good documentation and Unit tests need to be built in right from the start. Concurrent example code and tutorials will be needed to maximise adoption of the API. Target JDK = 1.5?
Involved toolkits or projects 
Richard Holland, Tobias Thierer, Mark Schreiber

What should prospective students know?

  • When applying, provide
    1. your interests, what makes you excited
    2. why you are interested in the project, and what you anticipate to gain from it
    3. a summary of your programming skills
    4. programs or projects you have previously authored or contributed to, in particular those available as open-source
  • Once we are accepted, we will create a mailing list through which you can ask any questions or propose a project of your own to the mentors.

Reference Facts & Links

Projects involved

Bio* projects 
The umbrella organization for the Bio* projects is the Open Bioinformatics Foundation (O|B|F). O|B|F is governed by a Board of Directors, organizes the Bioinformatics Open Source Conference (BOSC) on an annual basis since 2001, and provides hardware and system administration for the member projects.
The individual member projects are BioPerl, Biojava, Biopython, Bioruby, and BioSQL. Except for the latter, which provides a generic schema for certain life science data types, each of these projects represents the largest and most widely used toolkit in its respective language in the life sciences.
Perl Bio:: projects 
Bio::NEXUS is the only Level-III compliant parser of the NEXUS file format for phylogenetic data in the Perl programming language. Bio::CDAT is a container architecture for Character Data And Trees. CDAT is in the early stages of its development. The perceived end result is an architecture that keeps track of Bio::* data objects in ways that are applicable to the task at hand. For example, a CDAT subclass could manage the relationships between a tree and multiple data columns for input in comparative analyses. Bio::Phylo is an API for phylogenetic data used by the CIPRES project. Bio::Phylo is intended to be compatible with BioPerl and CDAT, while functioning as a petri dish for new object designs.

Google Summer of Code