Phyloinformatics Summer of Code 2007

From Phyloinformatics
Revision as of 18:23, 18 March 2007 by Hilmar (talk) (What should prospective students know?)
Jump to: navigation, search

We are applying to the Google Summer of Code (GSoC) program for the first time this year. On this page we are collecting ideas, possible projects, prerequisites, possible solution approaches, mentors, other people or channels to contact for more information or to bounce ideas off of, etc


  • 12:39, 15 March 2007 (EDT) We have been accepted! The mailing list is on-line and open to anyone to post questions or suggestions. I also published the application document (after obfuscating email addresses). Hlapp
  • 17:32, 12 March 2007 (EDT) Finally pulled everything together, and submitted application! Hlapp
  • 21:20, 2 March 2007 (EST) Created page, added a couple of links, outline, and started filling in some bits. Hlapp


We believe that the goals, targets, and prior work of this Phyloinformatics working group make it particularly well suited as a mentoring organization for the GSoC program, for basically three reasons.

  1. The code that students will write will facilitate new and increasingly complex questions to be asked in comparative biology, one of the central disciplines in understanding the evolution of life. As part of its agenda, the Phyloinformatics group has already collected the use-cases from the research community that represent the most common and pervasive problems in phyloinformatics. Work on the suggested or similar projects is bound to make an impact. In addition, NESCent is committed to train end-users and scientific software developers in using the resulting work through summer courses and conference tutorials.
  2. The range of problems that students can make meaningful contributions to is diverse, enabling us to accommodate different areas of interest and skills. The participating toolkits, and the GSoC project ideas we have generated, cover a variety of programming languages and tasks, yet are all directed towards the same overall goal. A diverse group of mentors is on hand and can quickly be expanded to entire developer communities of the participating projects.
  3. Aside from gaining solutions to problems in phyloinformatics, we view our GSoC participation as an opportunity to gain future contributors to reusable open-source software components in phyloinformatics from the pool of future researchers in comparative biology. Once accepted, we will advertise this program through appropriate channels to reach undergrad and grad students interested in computational comparative biology. Some of the mentors can relate particularly well to students who are novices in research programming.


Note: if there is more than one mentor for a project, the primary mentor is in bold font. Biographical and other information on the mentors is linked to in the Mentors section.

Enable multi-language bindings to the C++ NEXUS Class Library

The hackathon revealed that consistent behavior from NEXUS parsers is need in all of the Bio* toolkits as well as many of the primary analysis tools in phylogenetics. Rather than reimplement NEXUS parsing in each language, we could develop multiple language bindings to a common parsing library. NCL (the Nexus Class Library) is a flexible C++ library for supporting the parsing of NEXUS. Core tasks (such as tokenizing and reporting errors) are handled well, though not all public blocks have been handled. NCL is already used in several C/C++ programs (e.g. Brownie, TreeView) and is soon to be used by MrBayes4 and GARLI for NEXUS importing. The CIPRES NEXUS parser is also derived from a version of NCL.

Reusing this library could improve code-reuse in phyloinformatics and lead to more reliable behavior for tools that use NEXUS files in a pipeline.

SWIG is a system for writing C/C++ extensions to multiple scripting languages. By exposing the high level interface of NCL to scripting languages, the NCL can perform parsing and allow the scripting languages to simply extract the data they need. (The Inline architecture might also be used for easy perl bindings.)
NCL in written in portable C++ (using automake and autoconf for all non-windows platforms, and including hand-maintained build system for windows builds), but distributing code that depends on C/C++ libraries is always more difficult than pure-{python, perl, ruby, java} modules. Assuring a smooth platform-independent installation procedure will require work and lots of testing.
Involved toolkits or projects 
BioPerl, Biopython, Bioruby, SWIG, NCL
Mark Holder

Evolve Unix phyloinformatics tools into Ajax applications

Many powerful new tools for phylogenetic stochastic grammar analysis of multiple alignments, such as xrate or PHAST, as well as other phyloinformatic tools like PAML, PHYLIP, GeneTree, HyPhy etc., are available only from the Unix command line. These tools need to become operable over the web, especially via Ajax platforms such as the new Google Maps-like interface to GBrowse.
Use toolkits such as dojo to build asynchronous javascript wrappers for Unix tools (probabilistic modeling & phlogeny tools, format conversion utilities, sequence analysis & alignment software, genome annotation pipelines, grids & job queues, realtime parallelizable systems); other Javascript/web components (alignment viewers, tree viewers & navigators, genome browsers); and bioinformatics "mashups". Interface with gmod-ajax, Amigo and other web-based bioinformatics platforms.
Adapting command-line tools to for web use; creating an asynchronous user interface; developing infrastructure for mashable bioinformatics -- see for more ideas/details
Involved toolkits or projects 
BioPerl/Biopython/Bioruby; SWIG; dojo; Sun Grid Engine; Erlang
Ian Holmes, Mitch Skinner, Chris Mungall, Jason Stajich, Lincoln Stein

Extending the "make" paradigm for bioinformatics annotation pipelines

Annotating a genome, or performing other large-scale bioinformatics analyses, typically involves a series of operations with sequential dependencies but also strong parallelism. The "make" program is one robust approach that is often used to build such analysis pipelines, but suffers serious drawbacks for bioinformatics (e.g. no built-in database access; extremely limited pattern-matching; language is not extensible; dependencies are triggered only by file timestamps and not e.g. MD5 hash indicating file contents have changed).
The project will involve building a replacement or upgrade to "make". One possible approach will be to use a declarative language with (i) strong support for distributed processing, (ii) easy-to-use Unix "hooks" (c.f. make), (iii) database and filesystem access. Examples of candidate languages include Erlang and Termite Scheme. Alternatively, C-inclined students may start with an existing parallel "make" clone, such as qmake or distmake -- see for more ideas/details
The first challenge is to get something that is as convenient to use as "make" for migrating throwaway command-lines and analysis scripts into robust pipeline stages. Subsequent challenges will include database access, flexible pattern-matching and enhanced dependency triggers.
Involved toolkits or projects 
Erlang, Termite Scheme, distmake/qmake, or other.
Ian Holmes, Chris Mungall

AJAX widgets for phylo-informatics

The AJAX model offers exciting new opportunities for working with phylo-informatics data.
To build javascript widgets (visualization components with active hooks for clicking on various parts of the data structure) for structures such as phylogenetic trees and multiple sequence alignments.
The main challenge will be creating components that are responsively interactive, e.g. to change the order of nodes in a tree (or explode/implode nodes), and to design a clean interface to other js components.
Involved toolkits or projects 
Suzi Lewis, Chris Mungall

Create a usable phyloinformatics API for BioJava

Biojava has considerable sequence manipulation capability combined with distributions and good support for dynamic programming. It would be highly desirable to extend this capability to provide a phyloinformatics API.
Start with a simple object model that can represent phyloinformatics objects and concepts and provide basic I/O with common formats. Build on the experimental code that is currently in packages.
A workable, and elegant data model with a flexible I/O that is consistent with other biojava patterns. Good documentation and Unit tests need to be built in right from the start. Concurrent example code and tutorials will be needed to maximise adoption of the API. Target JDK = 1.5?
Involved toolkits or projects 
Richard Holland, Tobias Thierer, Mark Schreiber

Topological query application for BioSQL

The phyloinformatics hackathon created a Phylogenetic Tree extension model for BioSQL and formulated a series of topological queries on top of the model. This is the foundation of a "MyTreebase" that researchers can use to populate, manipulate, and query their own phylogenetic trees. An end-user application with a command-line UI and/or interactive web-based UI is missing yet, though.
An end-user application, whether command-line or web-based, should probably accomplish at least the following functions: importing trees, precomputing optimization values (nested set and transitive closure), querying trees, modifying trees (delete branch, add branch, move branch), and exporting trees.
A clean object-relational mapping has not been done yet neither for Bioperl nor for Biojava. Depending on the skills of the student(s), this could be a separate project, part of the project, or omitted in favor of a simplified (non-generic) mapping.
Involved toolkits or projects 
BioSQL, BioPerl/Bioperl-db or Biojava
Hilmar Lapp, Bill Piel

Phylogenetic XML <--> Object serialization

Phylogenetic data (i.e. evolutionary trees and the data they are based on) is commonly stored in the NEXUS flat text file format. Unfortunately, this format is fairly difficult to parse due to incompatible "dialects" that have evolved over time. Also, a formal means to validate all flavors of NEXUS does not exist. Several XML formats to replace NEXUS have been proposed, none of which, for practical and historical reasons, have been widely adopted. One reason why none of these efforts have been successful is that they never sought to integrate the proposed format with existing bioinformatics toolkits and the objects they define. We seek a student who will remedy this and design an XML format that:
  1. can express (at least the biological data part of) the NEXUS specification,
  2. can be validated using DTD or XML schema,
  3. is flexible enough to allow adding further annotations (e.g. a key/value system for taxa and sequences),
  4. is optimized for minimal memory requirements and to limit file size for large character matrices or trees (e.g., no deep nestings to obviate the need for stream-based parsers to hold large chunks in memory),
  5. uses uids and id references to link annotations to entities, nodes to taxa, etc.,
  6. is efficiently serialized into objects of one of the commonly used toolkits (see below).
There have been several attempts to create XML formats for phylogenetic data, for example PhyloXML. Several published and unpublished manuscripts take inventory of these, which can serve as a basis to design a parser as well as an optimized XML format definition. The parser would populate BioPerl/Bio::Phylo or Biojava objects, depending on the programming language preference of the student.
Trees as well as character state matrices can get very large. The challenge is to find the right trade-off between granularity, element and id ordering, and nesting in the XML definition - and an optimized parser implementation that is nevertheless robust.
Involved toolkits or projects 
PhyloXML, BioPerl, Bio::Phylo, Biojava, TreeBASE II, pPOD
Rutger Vos, Hilmar Lapp, Bill Piel


What should prospective students know?

  • When applying, provide
    1. your interests, what makes you excited
    2. why you are interested in the project, and what you anticipate to gain from it
    3. a summary of your programming skills
    4. programs or projects you have previously authored or contributed to, in particular those available as open-source
  • Please send any questions you have, or projects you would like to propose, to This will reach all mentors and our PhyloSoC adminstrators.
  • You can visit our application document with Google's questions and our answers. The part that isn't on this page or linked from here already is mostly in the last couple of questions on mentors and students.
  • Google has now put up instructions for how students apply on-line. Student application is now open and runs through March 24, 2007.
  • For eligibility, see the GSoC eligibility requirements for students. These requirements and the age restriction of 18 years or older must be met on April 9, 2007. Highschool students meeting the age requirement are also eligible.
  • Students will receive a stipend from Google.

Reference Facts & Links

Projects involved

Bio* projects 
The umbrella organization for the Bio* projects is the Open Bioinformatics Foundation (O|B|F). O|B|F is governed by a Board of Directors, organizes the Bioinformatics Open Source Conference (BOSC) on an annual basis since 2001, and provides hardware and system administration for the member projects.
The individual member projects are BioPerl, Biojava, Biopython, Bioruby, and BioSQL. Except for the latter, which provides a generic schema for certain life science data types, each of these projects represents the largest and most widely used toolkit in its respective language in the life sciences.
Perl Bio:: projects 
Bio::NEXUS is the only Level-III compliant parser of the NEXUS file format for phylogenetic data in the Perl programming language. Bio::CDAT is a container architecture for Character Data And Trees. CDAT is in the early stages of its development. The perceived end result is an architecture that keeps track of Bio::* data objects in ways that are applicable to the task at hand. For example, a CDAT subclass could manage the relationships between a tree and multiple data columns for input in comparative analyses. Bio::Phylo is an API for phylogenetic data used by the CIPRES project. Bio::Phylo is intended to be compatible with BioPerl and CDAT, while functioning as a petri dish for new object designs.

Google Summer of Code