Phyloinformatics Summer of Code 2007

From Phyloinformatics
Revision as of 22:39, 4 March 2007 by Ianholmes (talk)
Jump to: navigation, search

We are applying to the Google Summer of Code (GSoC) program for the first time this year. On this page we are collecting ideas, possible projects, prerequisites, possible solution approaches, mentors, other people or channels to contact for more information or to bounce ideas off of, etc


  • 21:20, 2 March 2007 (EST) Created page, added a couple of links, outline, and started filling in some bits. Hlapp


We believe that the goals, targets, and prior work of this Phyloinformatics working group make it particularly well suited as a mentoring organization for the GSoC program, for basically three reasons.

  1. The code that students would write will make a significant impact in evolutionary and comparative biology research, scientific disciplines that are driving forward the understanding of life in the postgenomic era. Part of the prior work of this group has been to collect use-cases from the evolutionary biology community at-large and flesh them out with their own expertise, resulting in the most common and pervasive problems caused by the lack of phyloinformatics cyberinfrastructure. Work to fill in the gaps is bound to make an impact. NESCent is also committed to disseminate the the contributions and their value for the research community through summer courses and conference tutorials.
  2. The range of problems that students can work on productively and make meaningful contributions is diverse, being able to accommodate different areas of interest, or levels of prior skills. The participating toolkits cover a variety of programming languages and tasks involved, yet are directed towards the same overall goal. A diverse group of mentors is on hand and can quickly be expanded to entire developer communities of the participating toolkits, which in the past have been very supportive to newbie programmers.
  3. We view this program as an opportunity to attract and train future researchers in phylogenetics or related disciplines who not only will become better aware of phyloinformatics infrastructure gaps, but are also able to fill those in themselves, in a manner that is increasingly standards-compliant and reusable by others. In particular, once accepted we will extensively advertise this programs through channels commonly read by undergrad and grad students in biology, bioinformatics, and computational biology. When reviewing students who apply, depending on the applicant we will weigh not only relevant programming knowledge, but also genuine interest in comparative biology research. Some of the mentors have been selected because they can relate particularly well to students who are novices in research programming.


The below is a template for how the student project ideas could be presented. Feel free to copy & paste & edit, and feel free to adjust the format ...

Write a NEXUS parser in C&

C& is an amp'ed-up programming language that has not been invented yet but in a few years will dominate the programming world. The best way to prevent broken non-compliant NEXUS parsers written in C& from appearing is to write a good one now.
Re-implementations of NEXUS parsers inevitably tend to be broken or non-compliant. Hence, the best approach is to write a translator that translates a reference implementation to C&.
C& has not been invented yet, so a lot of assumptions will have to be made.
Involved toolkits or projects 
The BioC& toolkit has much of the needed framework.
Mike&, founder of BioC&

Enable multi-language bindings to the C++ NEXUS Class Library

The hackathon revealed that consistent behavior from NEXUS parsers is need in all of the Bio* toolkits as well as many of the primary analysis tools in phylogenetics. Rather than reimplement NEXUS parsing in each language, we could develop multiple language bindings to a common parsing library. NCL (the Nexus Class Library) is a flexible C++ library for supporting the parsing of NEXUS. Core tasks (such as tokenizing and reporting errors) are handled well, though not all public blocks have been handled. NCL is already used in several C/C++ programs (e.g. Brownie, TreeView) and is soon to be used by MrBayes4 and GARLI for NEXUS importing. The CIPRES NEXUS parser is also derived from a version of NCL.

Reusing this library could improve code-reuse in phyloinformatics and lead to more reliable behavior for tools that use NEXUS files in a pipeline.

SWIG is a system for writing C/C++ extensions to multiple scripting languages. By exposing the high level interface of NCL to scripting languages, the NCL can perform parsing and allow the scripting languages to simply extract the data they need.
NCL in written in portable C++ (using automake and autoconf for all non-windows platforms, and including hand-maintained build system for windows builds), but distributing code that depends on C/C++ libraries is always more difficult than pure-{python, perl, ruby, java} modules. Assuring a smooth platform-independent installation procedure will require work and lots of testing.
Involved toolkits or projects 
BioPerl, Biopython, Bioruby, SWIG, NCL

Write AJAX front-ends to Unix tools for phyloinformatics

Many powerful new tools for phylogenetic stochastic grammar analysis of multiple alignments, such as xrate or PHAST, as well as PAML etc, are available only from the Unix command line. These tools need to become operable through the web, and in platforms such as GBrowse & AJAX.
Use toolkits such as dojo to build asynchronous javascript wrappers for probabilistic modeling & phlogeny tools, format conversion utilities, alignment viewing/coloring systems, and general AJAX components for bioinformatics "mashups". Interface with AJAX GBrowse, and other web-based bioinformatics services .
Adapting command-line tools to for web use; creating an asynchronous user interface.
Involved toolkits or projects 
BioPerl/Biopython/Bioruby; SWIG; dojo; Sun Grid Engine; Erlang;
Ian Holmes, Mitch Skinner, Chris Mungall

What should prospective students know?

Reference Facts & Links

Projects involved

Bio* projects 
The umbrella organization for the Bio* projects is the Open Bioinformatics Foundation (O|B|F). O|B|F is governed by a Board of Directors, organizes the Bioinformatics Open Source Conference (BOSC) on an annual basis since 2001, and provides hardware and system administration for the member projects.
The individual member projects are BioPerl, Biojava, Biopython, Bioruby, and BioSQL. Except for the latter, which provides a generic schema for certain life science data types, each of these projects represents the largest and most widely used toolkit in its respective language in the life sciences.
Perl Bio:: projects 
Bio::NEXUS is the only Level-III compliant parser of the NEXUS file format for phylogenetic data in the Perl programming language. Bio::CDAT is <please fill in here>. Bio::Phylo is <please fill in here>.
CIPRES project 
please fill in here; is this participating in the PhyloSoC program?

Google Summer of Code