Phyloinformatics Summer of Code 2007

From Phyloinformatics
Jump to: navigation, search

We applied to the Google Summer of Code (GSoC) program for the first time in 2007. On this page we collected ideas, possible projects, prerequisites, possible solution approaches, mentors, other people or channels to contact for more information or to bounce ideas off of, etc


Why

We believe that the goals, targets, and prior work of this Phyloinformatics working group make it particularly well suited as a mentoring organization for the GSoC program, for basically three reasons.

  1. The code that students will write will facilitate new and increasingly complex questions to be asked in comparative biology, one of the central disciplines in understanding the evolution of life. As part of its agenda, the Phyloinformatics group has already collected the use-cases from the research community that represent the most common and pervasive problems in phyloinformatics. Work on the suggested or similar projects is bound to make an impact. In addition, NESCent is committed to train end-users and scientific software developers in using the resulting work through summer courses and conference tutorials.
  2. The range of problems that students can make meaningful contributions to is diverse, enabling us to accommodate different areas of interest and skills. The participating toolkits, and the GSoC project ideas we have generated, cover a variety of programming languages and tasks, yet are all directed towards the same overall goal. A diverse group of mentors is on hand and can quickly be expanded to entire developer communities of the participating projects.
  3. Aside from gaining solutions to problems in phyloinformatics, we view our GSoC participation as an opportunity to gain future contributors to reusable open-source software components in phyloinformatics from the pool of future researchers in comparative biology. Once accepted, we will advertise this program through appropriate channels to reach undergrad and grad students interested in computational comparative biology. Some of the mentors can relate particularly well to students who are novices in research programming.

Accepted Projects

Note that the accepted projects, students, and their mentors are also published by Google.

A Perl-based Command Line Interface to a Topological Query Application for BioSQL in Support of High Throughput Classification and Analysis of LTR Retrotransposons in Plant Genomes

This is an application for the Topological query application for BioSQL project idea.

Abstract: I will use PERL to create a set of command line programs for topological queries in BioSQL. The goal of this project is to create an interface that is suitable for high throughput creation and modification of SQL based phylogenies. I will use this interface to further my research on the classification of plant LTR retrotransposons.

Student: James Estill

Mentor(s): Hilmar Lapp (primary), Weigang Qiu, Bill Piel, Mike Muratet (secondary)

Project Blog: http://phylosoc2007jestill.blogspot.com/

Project homepage: Command Line Topological Query Application for BioSQL

Source code: http://code.google.com/p/phylosoc2007jestill/

Phyloinformatics Web Tools: PhyloWidget

This application proposes a new project that builds upon the Topological query application for BioSQL project idea.

Abstract: The purpose of this project is to (a) create a user-friendly and web-accessible GUI for creating phylogenetic tree queries, and (b) implement a SOAP-based client and server protocol for transmitting phylogenetic queries and results.

Student: Gregory Jordan

Mentor(s): Bill Piel (primary), Hilmar Lapp (secondary)

Project homepage: http://www.phylowidget.org/

Source code: http://code.google.com/p/phylowidget/source

APIs for BioJava

This is an application for the Create a usable phyloinformatics API for BioJava project idea.

Abstract: We are planning to develop APIs for BioJava. Especially, we will focus on the phylogeny reconstructing methods such as UPGMA, Maximum Parsimony, and Maximum Likelihood

Student: Bohyun Lee

Mentor(s): Richard Holland

Project homepage: http://biojava.org/wiki/BioJava:PhyloSOC07

Source code: org.biojavax.bio.phylo package (SVN)

Multi-language bindings to the C++ NEXUS Class Library

This is an application for the Enable multi-language bindings to the C++ NEXUS Class Library project idea.

Abstract: The goal of this project is the development of bindings to NCL for three scripting languages (Perl, Python, Ruby) employing SWIG, the Simplified wrapper and interface generator, which is an open source tool designed to facilitate the development of extensions from C/C++ to another languages. Providing a way for rapid prototyping and easy development of applications supporting the NEXUS format.

Student: David Suárez Pascal

Mentor: Mark Holder

Project homepage: Multi-language bindings to the C++ NEXUS Class Library

Source code: http://svn.pascal-in.net/nclbindings (subversion repository)

Phylogenetic XML <--> Object serialization

This is an application for the Phylogenetic XML <--> Object_serialization project idea.

Abstract: The goal of this project is to develop an XML based file format for use in phylogenetic analysis. This file type will contain most of the functionality of the current NEXUS file format. Additionally, a parser for this file type will be developed for the BioPerl package.

Student: Jason Caravas

Mentor(s): Rutger Vos

Project homepage: Phylogenetic XML <--> Object serialization

Source code: http://code.google.com/p/nexml07gsoc/source

Ajax interface for the XRate command-line tool

This is an application for the Evolve Unix phyloinformatics tools into Ajax applications project idea.

Abstract: This project focuses on building an easy to use, visual interface for Xrate. It will bring together several tools from the dart bioinformatics package to annotate multiple sequence alignments, train phylogenetic grammars, and vizualize this information in an Ajax-enabled asynchronous web interface. As an initial application, it will be used to provide a web interface to explore and use the rfam collection of grammars.

Student: Lars Barquist

Mentor(s): Ian Holmes

Project homepage: xREI biowiki page (includes instructions for downloading the current source via cvs)

Project Plan: http://ajax-xrate.googlecode.com/files/plan.pdf

Source code: LarsEric_Barquist.tar.gz (original google repository, currently out of date)

Demo: http://harmony.biowiki.org/xrei/

Implementing a web interface for command line-based bioinformatics tools

This application is being revised for the AJAX widgets for phylo-informatics project idea.

Abstract: This project aims to construct a web-based interface for the tree display and annotation feature of the xrate tool. Through the use of an easy-to-use web interface, those that are not familiar with the process of compiling their own software in an Unix environment, but are comfortable with using a web browser, will be able to use xrate's tree tools. It is hoped that such an interface will then be portable enough to be used on other similar tools to allow for easy deployment of any command line based bioinformatics utility. It uses Perl scripts to generate either .png or .svg images of trees to be displayed on a webpage. These trees are generated from either raw .tre files (Newick format) or by using special .stk files which are processed by xrate to generate trees.

Student: James Leung

Mentor(s): Suzanna Lewis

Project homepage: http://code.google.com/p/xrateparser/

Source code: http://code.google.com/p/xrateparser/source

Phylogenetic & haplotype displays for GBrowse

This is an application for the Phylogenetic & haplotype displays for GBrowse project idea.

Abstract: This is a project to extend the visualization for the Genome Browser by dividing the single large image to multiple tracks. Phylogenetic information will also be represented as a new data track as part of this extension.

Student: Hisanaga Mark Okada

Mentor(s): Lincoln Stein

Project homepage: Phylogenetic and Haplotype Displays for GBrowse

Source code: phylo_align.pm in GBrowse and Bio::Graphics code within GMOD on SourceForge

Estimation of divergence time priors from fossil occurrence data

This application proposes a new project idea.

Abstract: I will use C++ to develop a software tool that will function in the analysis of fossil occurrence data from the Paleobiology Database to calculate informative priors on divergence times, which will be implemented in the open source Bayesian molecular divergence time package BEAST.

Student: Michael Nowak

Mentor(s): Derrick Zwickl

Project homepage: Estimation of divergence time priors from fossil occurrence data

Project blog: PhyloSoC: Divergence Time Priors from Fossil Occurrence Data

Source code: MichaelDennis_Nowak.tar.gz

Visualizing Phylogeographic Information

This application proposes a new project idea.

Abstract: This project will develop a web based application that generates geographic maps of DNA haplotype data that are often used in phylogeographic analysis. The application will create maps of pie charts viewable through Google Earth that show the spatial distributions of haplotypes, the frequency of each haplotype within each population, and the number of samples per population.

Student: Yi-Hsin Erica Tsai

Mentor(s): David Kidd

Project homepage: Visualizing Phylogeographic Information

Project blog: http://phylogeoviz.blogspot.com/

Source code: http://code.google.com/p/phylogeoviz/

Biodiversity conservation algorithms and GUI

This application proposes a new project idea.

Abstract: I will implement various algorithms that utilise phylogenetic information to prioritise species for biodiversity conservation. A GUI will also be developed allowing these algorithms to be utilised by conservation managers. The overall goal is to provide a package that brings together as many existing approaches as possible and provides an interface between mathematical results and their intended final audience.

Student: Klaas Hartmann

Mentor(s): Tobias Thierer (primary), Rutger Vos (secondary)

Project Blog: http://phylosoc2007khartmann.blogspot.com/

Project homepage: Biodiversity Conservation Algorithms and GUI

Source code: http://code.google.com/p/phylosoc2007khartmann/

Ideas

Note: if there is more than one mentor for a project, the primary mentor is in bold font. Biographical and other information on the mentors is linked to in the Mentors section.

The student selection is now final and Google has published our accepted projects, the students, and their mentors.

Enable multi-language bindings to the C++ NEXUS Class Library

Rationale 
The hackathon revealed that consistent behavior from NEXUS parsers is need in all of the Bio* toolkits as well as many of the primary analysis tools in phylogenetics. Rather than reimplement NEXUS parsing in each language, we could develop multiple language bindings to a common parsing library. NCL (the Nexus Class Library) is a flexible C++ library for supporting the parsing of NEXUS. Core tasks (such as tokenizing and reporting errors) are handled well, though not all public blocks have been handled. NCL is already used in several C/C++ programs (e.g. Brownie, TreeView) and is soon to be used by MrBayes4 and GARLI for NEXUS importing. The CIPRES NEXUS parser is also derived from a version of NCL.
Reusing this library could improve code-reuse in phyloinformatics and lead to more reliable behavior for tools that use NEXUS files in a pipeline.
Approach 
SWIG is a system for writing C/C++ extensions to multiple scripting languages. By exposing the high level interface of NCL to scripting languages, the NCL can perform parsing and allow the scripting languages to simply extract the data they need. (The Inline architecture might also be used for easy perl bindings.)
Challenges 
NCL in written in portable C++ (using automake and autoconf for all non-windows platforms, and including hand-maintained build system for windows builds), but distributing code that depends on C/C++ libraries is always more difficult than pure-{python, perl, ruby, java} modules. Assuring a smooth platform-independent installation procedure will require work and lots of testing.
Involved toolkits or projects 
BioPerl, Biopython, Bioruby, SWIG, NCL
Mentors 
Mark Holder, Mike Muratet

Evolve Unix phyloinformatics tools into Ajax applications

Rationale 
Many powerful new tools for phylogenetic stochastic grammar analysis of multiple alignments, such as xrate or PHAST, as well as other phyloinformatic tools like PAML, PHYLIP, GeneTree, HyPhy etc., are available only from the Unix command line. These tools need to become operable over the web, especially via Ajax platforms such as the new Google Maps-like interface to GBrowse.
Approach 
Initial approach outlined at biowiki.org/AjaxPhyloinformatics -- General idea: use toolkits such as dojo to build asynchronous javascript wrappers for Unix tools (probabilistic modeling & phlogeny tools, format conversion utilities, sequence analysis & alignment software, genome annotation pipelines, grids & job queues, realtime parallelizable systems); other Javascript/web components (alignment viewers, tree viewers & navigators, genome browsers); and bioinformatics "mashups". Interface with gmod-ajax, Amigo and other web-based bioinformatics platforms.
Challenges 
Adapting command-line tools to for web use; creating an asynchronous user interface; developing infrastructure for mashable bioinformatics -- see http://biowiki.org/AjaxPhyloinformatics and http://biowiki.org/SummerOfCode for more ideas/details
Involved toolkits or projects 
BioPerl/Biopython/Bioruby; SWIG; dojo; Sun Grid Engine; Erlang
Mentors 
Ian Holmes, Mitch Skinner, Chris Mungall, Jason Stajich, Lincoln Stein

Extending the "make" paradigm for bioinformatics annotation pipelines

Rationale 
Annotating a genome, or performing other large-scale bioinformatics analyses, typically involves a series of operations with sequential dependencies but also strong parallelism. The "make" program is one robust approach that is often used to build such analysis pipelines, but suffers serious drawbacks for bioinformatics (e.g. no built-in database access; extremely limited pattern-matching; language is not extensible; dependencies are triggered only by file timestamps and not e.g. MD5 hash indicating file contents have changed).
Approach 
The project will involve building a replacement or upgrade to "make". One possible approach will be to use a declarative language with (i) strong support for distributed processing, (ii) easy-to-use Unix "hooks" (c.f. make), (iii) database and filesystem access. Examples of candidate languages include Erlang and Termite Scheme. Alternatively, C-inclined students may start with an existing parallel "make" clone, such as qmake or distmake -- see http://biowiki.org/SummerOfCode for more ideas/details and http://biowiki.org/MakefileLabMeeting for lots of relevant info (lists of existing make clones/alternatives, etc.)
Challenges 
The first challenge is to get something that is as convenient to use as "make" for migrating throwaway command-lines and analysis scripts into robust pipeline stages. Subsequent challenges will include database access, flexible pattern-matching and enhanced dependency triggers.
Involved toolkits or projects 
Erlang, Termite Scheme, distmake/qmake, or other.
Mentors 
Ian Holmes, Chris Mungall

AJAX widgets for phylo-informatics

Rationale 
The AJAX model offers exciting new opportunities for working with phylo-informatics data. Ultimately we aim to provide biological curators with a tool for functionally annotating as many proteins and protein families as possible, based on homology to proteins where the function has been experimentally determined. The desired tool will be useful to anyone who is doing functional annotation based on shared evolutionary history. It will be of particular use in comparing human disease genes to the homologous genes in other organisms.
Approach 
To build javascript widgets (visualization components with active hooks for clicking on various parts of the data structure) for structures such as phylogenetic trees and multiple sequence alignments.
Challenges 
The main challenge will be creating components that are responsively interactive, e.g. to change the order of nodes in a tree (or explode/implode nodes), and to design a clean interface to other js components.
Involved toolkits or projects 
dojo
Mentors 
Suzi Lewis, Chris Mungall

Phylogenetic & haplotype displays for GBrowse

Rationale 
The Generic Genome Browser is an open source browser for genome annotations. The power of the tool to visualize evolutionary changes among multiple genomes would be greatly enhanced by placing interactive phylogenetic trees directly into GBrowse tracks.
Approach 
To use AJAX to embed javascript widgets for phylogenetic trees, genetic haplotypes, and multiple sequence alignnments into GBrowse tracks.
Challenges 
The main challenge will be creating a sufficiently generic framework to handle multiple species and data types.
Involved toolkits or projects 
GBrowse, BioPerl
Mentors 
Lincoln Stein (Cold Spring Harbor Laboratory)

Storing Phylogenetic Trees in Chado

Rationale 
Chado is a generic relational schema used for many model organism databases. It needs to be extended to represent arbitrary graphs, including phylogenetic trees, reaction networks, and interaction graphs.
Approach 
Develop a UML model to describe various types of graphs, and then implement it in the PostgreSQL data definition language, along with any required views and stored procedures.
Challenges
Having sufficient generality to represent multiple types of biological graphs, but having sufficient efficiency to run typical queries at interactive speeds.
Mentors 
Lincoln Stein (Cold Spring Harbor Laboratory)

Create a usable phyloinformatics API for BioJava

Goals 
Start with a simple object model that can represent phyloinformatics objects and concepts and provide basic I/O with common formats. Build on the experimental code that is currently in org.biojavax.bio.phylo packages.
Challenges 
A workable, and elegant data model with a flexible I/O that is consistent with other biojava patterns. Good documentation and Unit tests need to be built in right from the start. Concurrent example code and tutorials will be needed to maximise adoption of the API. Target JDK = 1.5.
Involved toolkits or projects 
biojava
Mentors 
Richard Holland (primary), Tobias Thierer, Mark Schreiber
Student 
Boh-Yun Lee
Project documentation and wiki
http://biojava.org/wiki/BioJava:PhyloSOC07 biojava.org/wiki/BioJava:PhyloSOC07]
Source code
Will be stored in the BioJava CVS under biojava-live, package org.biojavax.phylo.

Topological query application for BioSQL

Rationale 
The phyloinformatics hackathon created a Phylogenetic Tree extension model for BioSQL and formulated a series of topological queries on top of the model. This is the foundation of a "MyTreebase" that researchers can use to populate, manipulate, and query their own phylogenetic trees. An end-user application with a command-line UI and/or interactive web-based UI is missing yet, though.
Approach 
An end-user application, whether command-line or web-based, should probably accomplish at least the following functions: importing trees, precomputing optimization values (nested set and transitive closure), querying trees, modifying trees (delete branch, add branch, move branch), and exporting trees.
Challenges 
A clean object-relational mapping has not been done yet neither for Bioperl nor for Biojava. Depending on the skills of the student(s), this could be a separate project, part of the project, or omitted in favor of a simplified (non-generic) mapping.
Involved toolkits or projects 
BioSQL, BioPerl/Bioperl-db or Biojava
Mentors 
Hilmar Lapp, Bill Piel, Mike Muratet

Phylogenetic XML <--> Object serialization

Rationale
Phylogenetic data (i.e. evolutionary trees and the data they are based on) is commonly stored in the NEXUS flat text file format. Unfortunately, this format is fairly difficult to parse due to incompatible "dialects" that have evolved over time. Also, a formal means to validate all flavors of NEXUS does not exist. Several XML formats to replace NEXUS have been proposed, none of which, for practical and historical reasons, have been widely adopted. One reason why none of these efforts have been successful is that they never sought to integrate the proposed format with existing bioinformatics toolkits and the objects they define. We seek a student who will remedy this and design an XML format that:
  1. can express (at least the biological data part of) the NEXUS specification,
  2. can be validated using DTD or XML schema,
  3. is flexible enough to allow adding further annotations (e.g. a key/value system for taxa and sequences),
  4. is optimized for minimal memory requirements and to limit file size for large character matrices or trees (e.g., no deep nestings to obviate the need for stream-based parsers to hold large chunks in memory),
  5. uses uids and id references to link annotations to entities, nodes to taxa, etc.,
  6. is efficiently serialized into objects of one of the commonly used toolkits (see below).
See also the announcement on PerlMonks.
Approach 
There have been several attempts to create XML formats for phylogenetic data, for example PhyloXML. Several published and unpublished manuscripts take inventory of these, which can serve as a basis to design a parser as well as an optimized XML format definition. The parser would populate BioPerl/Bio::Phylo or Biojava objects, depending on the programming language preference of the student.
Challenges 
Trees as well as character state matrices can get very large. The challenge is to find the right trade-off between granularity, element and id ordering, and nesting in the XML definition - and an optimized parser implementation that is nevertheless robust.
Involved toolkits or projects 
PhyloXML, BioPerl, Bio::Phylo, Biojava, TreeBASE II, pPOD
Mentors 
Rutger Vos, Hilmar Lapp, Bill Piel, Mike Muratet

Mentors

What should prospective students know?

Note that the final acceptance of students has been made by Google. The below is here mostly for historical reasons ...

  • Student application is now open and runs through Tuesday, March 27, 2007, 9am PDT (12pm EDT). Google has instructions for how students apply on-line, and there is an FAQ answer on how applications should look like. You may apply for more than one project, simply submit multiple applications.
  • When applying, (aside from the information requested by Google) provide
    1. your interests, what makes you excited
    2. why you are interested in the project, and what you anticipate to gain from it
    3. a summary of your programming skills
    4. programs or projects you have previously authored or contributed to, in particular those available as open-source
    5. a project plan (i.e., what you expect to doing when over the course of the project) for any project proposal different or modified from the ideas above; even if you propose to work directly on one of the ideas above, presenting a project plan will strengthen your application, as it will show that you have thought about how one might want to go about the work.
  • Please send any questions you have, or projects you would like to propose, to phylosoc@nescent.org. This will reach all mentors and our PhyloSoC adminstrators. We recommend you do this even if you want to work directly on one of our projects ideas above; it gives you an opportunity to get feedback on what our expectations might be, and you might want to ask for more specifics. Also, you can bounce your project plan off of us.
  • You can visit our application document with Google's questions and our answers. The part that isn't on this page or linked from here already is mostly in the last couple of questions on mentors and students.
  • For eligibility, see the GSoC eligibility requirements for students. These requirements and the age restriction of 18 years or older must be met on April 9, 2007. Highschool students meeting the age requirement are also eligible.
  • There is also a Google group for posting GSoC questions (and receiving answers; note that you will need to sign up for the group) that relate to the program itself (rather than to our organization).
  • Students will receive a stipend from Google.

Reference Facts & Links

Projects involved

Bio* projects 
The umbrella organization for the Bio* projects is the Open Bioinformatics Foundation (O|B|F). O|B|F is governed by a Board of Directors, organizes the Bioinformatics Open Source Conference (BOSC) on an annual basis since 2001, and provides hardware and system administration for the member projects.
The individual member projects are BioPerl, Biojava, Biopython, Bioruby, and BioSQL. Except for the latter, which provides a generic schema for certain life science data types, each of these projects represents the largest and most widely used toolkit in its respective language in the life sciences.
Perl Bio:: projects 
Bio::NEXUS is the only Level-III compliant parser of the NEXUS file format for phylogenetic data in the Perl programming language. Bio::CDAT is a container architecture for Character Data And Trees. CDAT is in the early stages of its development. The perceived end result is an architecture that keeps track of Bio::* data objects in ways that are applicable to the task at hand. For example, a CDAT subclass could manage the relationships between a tree and multiple data columns for input in comparative analyses. Bio::Phylo is an API for phylogenetic data used by the CIPRES project. Bio::Phylo is intended to be compatible with BioPerl and CDAT, while functioning as a petri dish for new object designs.

Google Summer of Code