Difference between revisions of "Phyloinformatics Summer of Code 2007"

From Phyloinformatics
Jump to: navigation, search
(Ideas: biojava)
Line 40: Line 40:
 
; Involved toolkits or projects : [http://www.bioperl.org BioPerl]/[http://www.biopython.org Biopython]/[http://www.bioruby.org Bioruby]; [http://www.swig.org SWIG]; [http://dojotoolkit.org/ dojo]; [http://gridengine.sunsource.net/ Sun Grid Engine]; [http://www.erlang.org/ Erlang]
 
; Involved toolkits or projects : [http://www.bioperl.org BioPerl]/[http://www.biopython.org Biopython]/[http://www.bioruby.org Bioruby]; [http://www.swig.org SWIG]; [http://dojotoolkit.org/ dojo]; [http://gridengine.sunsource.net/ Sun Grid Engine]; [http://www.erlang.org/ Erlang]
 
; Mentors : Ian Holmes, Mitch Skinner, Chris Mungall, Jason Stajich
 
; Mentors : Ian Holmes, Mitch Skinner, Chris Mungall, Jason Stajich
 +
 +
===Create a usable phyloinformatics API for BioJava===
 +
 +
; Rationale : [http://www.biojava.org/ Biojava] has considerable sequence manipulation capability combined with distributions and good support for dynamic programming. It would be highly desirable to extend this capability to provide a phyloinformatics API.
 +
; Approach : Start with a simple object model that can represent phyloinformatics objects and concepts and provide basic I/O with common formats. Build on the experimental code that is currently in org.biojavax.bio.phylo packages.
 +
; Challenges : A workable, and elegant data model with a flexible I/O that is consistent with other biojava patterns. Good documentation and Unit tests need to be built in right from the start. Concurrent example code and tutorials will be needed to maximise adoption of the API. Target JDK = 1.5?
 +
; Mentors : Richard Holland, Tobias Thierer, Mark Schreiber
  
 
==What should prospective students know?==
 
==What should prospective students know?==

Revision as of 01:47, 5 March 2007

We are applying to the Google Summer of Code (GSoC) program for the first time this year. On this page we are collecting ideas, possible projects, prerequisites, possible solution approaches, mentors, other people or channels to contact for more information or to bounce ideas off of, etc

News

  • 21:20, 2 March 2007 (EST) Created page, added a couple of links, outline, and started filling in some bits. Hlapp

Why

We believe that the goals, targets, and prior work of this Phyloinformatics working group make it particularly well suited as a mentoring organization for the GSoC program, for basically three reasons.

  1. The code that students would write will make a significant impact in evolutionary and comparative biology research, scientific disciplines that are driving forward the understanding of life in the postgenomic era. Part of the prior work of this group has been to collect use-cases from the evolutionary biology community at-large and flesh them out with their own expertise, resulting in the most common and pervasive problems caused by the lack of phyloinformatics cyberinfrastructure. Work to fill in the gaps is bound to make an impact. NESCent is also committed to disseminate the the contributions and their value for the research community through summer courses and conference tutorials.
  2. The range of problems that students can work on productively and make meaningful contributions is diverse, being able to accommodate different areas of interest, or levels of prior skills. The participating toolkits cover a variety of programming languages and tasks involved, yet are directed towards the same overall goal. A diverse group of mentors is on hand and can quickly be expanded to entire developer communities of the participating toolkits, which in the past have been very supportive to newbie programmers.
  3. We view this program as an opportunity to attract and train future researchers in phylogenetics or related disciplines who not only will become better aware of phyloinformatics infrastructure gaps, but are also able to fill those in themselves, in a manner that is increasingly standards-compliant and reusable by others. In particular, once accepted we will extensively advertise this programs through channels commonly read by undergrad and grad students in biology, bioinformatics, and computational biology. When reviewing students who apply, depending on the applicant we will weigh not only relevant programming knowledge, but also genuine interest in comparative biology research. Some of the mentors have been selected because they can relate particularly well to students who are novices in research programming.

Ideas

The below is a template for how the student project ideas could be presented. Feel free to copy & paste & edit, and feel free to adjust the format ...

Write a NEXUS parser in C&

Rationale 
C& is an amp'ed-up programming language that has not been invented yet but in a few years will dominate the programming world. The best way to prevent broken non-compliant NEXUS parsers written in C& from appearing is to write a good one now.
Approach 
Re-implementations of NEXUS parsers inevitably tend to be broken or non-compliant. Hence, the best approach is to write a translator that translates a reference implementation to C&.
Challenges 
C& has not been invented yet, so a lot of assumptions will have to be made.
Involved toolkits or projects 
The BioC& toolkit has much of the needed framework.
Mentors 
Mike&, founder of BioC&

Enable multi-language bindings to the C++ NEXUS Class Library

Rationale 
The hackathon revealed that consistent behavior from NEXUS parsers is need in all of the Bio* toolkits as well as many of the primary analysis tools in phylogenetics. Rather than reimplement NEXUS parsing in each language, we could develop multiple language bindings to a common parsing library. NCL (the Nexus Class Library) is a flexible C++ library for supporting the parsing of NEXUS. Core tasks (such as tokenizing and reporting errors) are handled well, though not all public blocks have been handled. NCL is already used in several C/C++ programs (e.g. Brownie, TreeView) and is soon to be used by MrBayes4 and GARLI for NEXUS importing. The CIPRES NEXUS parser is also derived from a version of NCL.

Reusing this library could improve code-reuse in phyloinformatics and lead to more reliable behavior for tools that use NEXUS files in a pipeline.

Approach 
SWIG is a system for writing C/C++ extensions to multiple scripting languages. By exposing the high level interface of NCL to scripting languages, the NCL can perform parsing and allow the scripting languages to simply extract the data they need.
Challenges 
NCL in written in portable C++ (using automake and autoconf for all non-windows platforms, and including hand-maintained build system for windows builds), but distributing code that depends on C/C++ libraries is always more difficult than pure-{python, perl, ruby, java} modules. Assuring a smooth platform-independent installation procedure will require work and lots of testing.
Involved toolkits or projects 
BioPerl, Biopython, Bioruby, SWIG, NCL
Mentors 
MarkHolder

Evolve Unix phyloinformatics tools into Ajax applications

Rationale 
Many powerful new tools for phylogenetic stochastic grammar analysis of multiple alignments, such as xrate or PHAST, as well as other phyloinformatic tools like PAML, PHYLIP, GeneTree, HyPhy etc., are available only from the Unix command line. These tools need to become operable over the web, especially via Ajax platforms such as the new Google Maps-like interface to GBrowse.
Approach 
Use toolkits such as dojo to build asynchronous javascript wrappers for Unix tools (probabilistic modeling & phlogeny tools, format conversion utilities, sequence analysis & alignment software, genome annotation pipelines, grids & job queues, realtime parallelizable systems); other Javascript/web components (alignment viewers, tree viewers & navigators, genome browsers); and bioinformatics "mashups". Interface with gmod-ajax, Amigo and other web-based bioinformatics platforms.
Challenges 
Adapting command-line tools to for web use; creating an asynchronous user interface; developing infrastructure for mashable bioinformatics...
Involved toolkits or projects 
BioPerl/Biopython/Bioruby; SWIG; dojo; Sun Grid Engine; Erlang
Mentors 
Ian Holmes, Mitch Skinner, Chris Mungall, Jason Stajich

Create a usable phyloinformatics API for BioJava

Rationale 
Biojava has considerable sequence manipulation capability combined with distributions and good support for dynamic programming. It would be highly desirable to extend this capability to provide a phyloinformatics API.
Approach 
Start with a simple object model that can represent phyloinformatics objects and concepts and provide basic I/O with common formats. Build on the experimental code that is currently in org.biojavax.bio.phylo packages.
Challenges 
A workable, and elegant data model with a flexible I/O that is consistent with other biojava patterns. Good documentation and Unit tests need to be built in right from the start. Concurrent example code and tutorials will be needed to maximise adoption of the API. Target JDK = 1.5?
Mentors 
Richard Holland, Tobias Thierer, Mark Schreiber

What should prospective students know?

Reference Facts & Links

Projects involved

Bio* projects 
The umbrella organization for the Bio* projects is the Open Bioinformatics Foundation (O|B|F). O|B|F is governed by a Board of Directors, organizes the Bioinformatics Open Source Conference (BOSC) on an annual basis since 2001, and provides hardware and system administration for the member projects.
The individual member projects are BioPerl, Biojava, Biopython, Bioruby, and BioSQL. Except for the latter, which provides a generic schema for certain life science data types, each of these projects represents the largest and most widely used toolkit in its respective language in the life sciences.
Perl Bio:: projects 
Bio::NEXUS is the only Level-III compliant parser of the NEXUS file format for phylogenetic data in the Perl programming language. Bio::CDAT is <please fill in here>. Bio::Phylo is <please fill in here>.
CIPRES project 
please fill in here; is this participating in the PhyloSoC program?

Google Summer of Code