Phyloinformatics Summer of Code 2007

From Phyloinformatics
Revision as of 17:04, 4 March 2007 by Ianholmes (talk)
Jump to: navigation, search

We are applying to the Google Summer of Code (GSoC) program for the first time this year. On this page we are collecting ideas, possible projects, prerequisites, possible solution approaches, mentors, other people or channels to contact for more information or to bounce ideas off of, etc

News

  • 21:20, 2 March 2007 (EST) Created page, added a couple of links, outline, and started filling in some bits. Hlapp

Why

We believe that the goals, targets, and prior work of this Phyloinformatics working group make it particularly well suited as a mentoring organization for the GSoC program, for basically three reasons.

  1. The code that students would write will make a significant impact in evolutionary and comparative biology research, scientific disciplines that are driving forward the understanding of life in the postgenomic era. Part of the prior work of this group has been to collect use-cases from the evolutionary biology community at-large and flesh them out with their own expertise, resulting in the most common and pervasive problems caused by the lack of phyloinformatics cyberinfrastructure. Work to fill in the gaps is bound to make an impact. NESCent is also committed to disseminate the the contributions and their value for the research community through summer courses and conference tutorials.
  2. The range of problems that students can work on productively and make meaningful contributions is diverse, being able to accommodate different areas of interest, or levels of prior skills. The participating toolkits cover a variety of programming languages and tasks involved, yet are directed towards the same overall goal. A diverse group of mentors is on hand and can quickly be expanded to entire developer communities of the participating toolkits, which in the past have been very supportive to newbie programmers.
  3. We view this program as an opportunity to attract and train future researchers in phylogenetics or related disciplines who not only will become better aware of phyloinformatics infrastructure gaps, but are also able to fill those in themselves, in a manner that is increasingly standards-compliant and reusable by others. In particular, once accepted we will extensively advertise this programs through channels commonly read by undergrad and grad students in biology, bioinformatics, and computational biology. When reviewing students who apply, depending on the applicant we will weigh not only relevant programming knowledge, but also genuine interest in comparative biology research. Some of the mentors have been selected because they can relate particularly well to students who are novices in research programming.

Ideas

The below is a template for how the student project ideas could be presented. Feel free to copy & paste & edit, and feel free to adjust the format ...

Write a NEXUS parser in C&

Rationale 
C& is an amp'ed-up programming language that has not been invented yet but in a few years will dominate the programming world. The best way to prevent broken non-compliant NEXUS parsers written in C& from appearing is to write a good one now.
Approach 
Re-implementations of NEXUS parsers inevitably tend to be broken or non-compliant. Hence, the best approach is to write a translator that translates a reference implementation to C&.
Challenges 
C& has not been invented yet, so a lot of assumptions will have to be made.
Involved toolkits or projects 
The BioC& toolkit has much of the needed framework.
Mentors 
Mike&, founder of BioC&

Enable multi-language bindings to the C++ NEXUS Class Library

Rationale 
The hackathon revealed that consistent behavior from NEXUS parsers is need in all of the Bio* toolkits as well as many of the primary analysis tools in phylogenetics. Rather than reimplement NEXUS parsing in each language, we could develop multiple language bindings to a common parsing library. NCL (the Nexus Class Library) is a flexible C++ library for supporting the parsing of NEXUS. Core tasks (such as tokenizing and reporting errors) are handled well, though not all public blocks have been handled. NCL is already used in several C/C++ programs (e.g. Brownie, TreeView) and is soon to be used by MrBayes4 and GARLI for NEXUS importing. The CIPRES NEXUS parser is also derived from a version of NCL.

Reusing this library could improve code-reuse in phyloinformatics and lead to more reliable behavior for tools that use NEXUS files in a pipeline.

Approach 
SWIG is a system for writing C/C++ extensions to multiple scripting languages. By exposing the high level interface of NCL to scripting languages, the NCL can perform parsing and allow the scripting languages to simply extract the data they need.
Challenges 
NCL in written in portable C++ (using automake and autoconf for all non-windows platforms, and including hand-maintained build system for windows builds), but distributing code that depends on C/C++ libraries is always more difficult than pure-{python, perl, ruby, java} modules. Assuring a smooth platform-independent installation procedure will require work and lots of testing.
Involved toolkits or projects 
BioPerl, Biopython, Bioruby, SWIG, NCL
Mentors 
MarkHolder

Write AJAX front-ends to Unix tools for phyloinformatics

Rationale 
Many powerful new tools for phylogenetic analysis of multiple alignments, such as [xrate], are available only from the Unix command line. These tools should be accessible from web-based systems such as GBrowse.
Approach 
Use toolkits such as [dojo] to build asynchronous javascript wrappers for robust informatics analysis tools; interface with the [AJAX GBrowse] and other interfaces.
Challenges 
Adapting the rich command-line interface to express its full power over the web; managing asynchronicity issues.
Involved toolkits or projects 
BioPerl, Biopython, Bioruby, SWIG, [dojo]
Mentors 
IanHolmes

What should prospective students know?

Reference Facts & Links

Projects involved

Bio* projects 
The umbrella organization for the Bio* projects is the Open Bioinformatics Foundation (O|B|F). O|B|F is governed by a Board of Directors, organizes the Bioinformatics Open Source Conference (BOSC) on an annual basis since 2001, and provides hardware and system administration for the member projects.
The individual member projects are BioPerl, Biojava, Biopython, Bioruby, and BioSQL. Except for the latter, which provides a generic schema for certain life science data types, each of these projects represents the largest and most widely used toolkit in its respective language in the life sciences.
Perl Bio:: projects 
Bio::NEXUS is the only Level-III compliant parser of the NEXUS file format for phylogenetic data in the Perl programming language. Bio::CDAT is <please fill in here>. Bio::Phylo is <please fill in here>.
CIPRES project 
please fill in here; is this participating in the PhyloSoC program?

Google Summer of Code