Phyloinformatics Summer of Code 2008

From Phyloinformatics
Jump to: navigation, search

We have been accepted into the Google Summer of Code (GSoC) program again this year. It is our second time in the program. Please see Phyloinformatics Summer of Code 2007 for information on our participation last year, and we also wrote up the results of the projects for the NESCent newsletter.

On this page we are collecting ideas, possible projects, prerequisites, possible solution approaches, mentors, other people or channels to contact for more information or to bounce ideas off of, etc.


Our organization administrators are Hilmar Lapp ( and Todd J. Vision (

If you are a student interested in applying for a Google Summer of Code project in our organization, please send any questions you have, projects you would like to propose, etc to This will reach all mentors and our Summer of Code organization adminstrators. You can also reach the (a little larger) community of previous mentors, students, hackathon participants, and other phyloinformatics software developers at

We aim to hang out regularly on IRC at least on weekdays during working hours (EDT) in #phylosoc on Freenode. You're welcome to join us at any time, though be prepared that outside of those times we may not be on-line. Email will typically work well in either case. (If you do not have an IRC client installed, you might find the comparison on Wikipedia, the Google directory, or the IRC Reviews helpful. For Macs, X-Chat Aqua works pretty well. If you have never used IRC, try the IRC Primer at IRC Help, which also has links to lots of other material.)

For applying, please make sure you read our documentation on what students should know before you apply. We don't have a format template for application that you need to adhere to, but we do ask you to include specific kinds of information, and what those are is documented under "When you apply."


We are participating in the Google Sumer of Code as a mentoring organization to introduce future researchers in comparative biology to open-source and open, collaborative development of reusable, interoperable, and standards-supporting software that enables new and increasingly complex scientific questions to be addressed. We are particularly interested in training future researchers to not only have awareness and understanding of the value of open-source and collabaratively developed software, but also to gain the programming and remote collaboration skills needed to successfully contribute to such projects.

We believe that the area of phyloinformatics is particularly well suited as a focus for our participation and to give our projects a coherent bigger picture:

  1. The code that students will write will facilitate new and increasingly complex questions to be asked in comparative biology, one of the central disciplines in understanding the evolution of life. We have already collected use-cases from the research community, and our mentors are active and leading among efforts that are trying to address cyberinfrastructure challenges in phyloinformatics. Work on projects along the lines of our suggested project ideas is bound to make an impact.
  2. The range of problems that students can make meaningful contributions to is diverse, enabling us to accommodate different areas of interest and skills. The GSoC project ideas we have generated cover a variety of programming languages, tasks, and skill levels, yet are all directed towards the same overall goals.
  3. We view our GSoC participation as an opportunity to gain future contributors to reusable open-source software components in phyloinformatics. Once accepted, we will advertise this program through appropriate channels to reach undergrad and grad students interested in computational comparative biology.

Accepted Projects

Note that the accepted projects, students, and their mentors are also published by Google, except for the externally funded project Matrix display of phenotype annotations using ontologies in Phenote.

Tree and data plotting in the phylobase project

This is an application for the Phylogeny plotting for the R package phylobase idea.

Abstract: This project will implement a new plotting function for visualizing the phylo4 and phylo4d data objects. A key feature of the functions is the use of the grid graphics interface. This interface will allow more flexible plot resizing and the pushing of arbitrary plots to tree tips.

Student: Peter Cowan

Mentor(s): Ben Bolker (primary), Steven Kembel

Project Homepage: PhyloSoC:Tree and data plotting in the phylobase project

Source Code:

Demo page: PhyloSoC:Tree and data plotting demonstration

Enhancing the representation of ecophylogenetic tools in R

This is a new project idea that proposes to build on and enhance the Picante R-package.

Abstract: Methods will be coded to ask questions such as: are closely related species found within the same community; are closely related species found in the same environments; and do closely related species interact with the same species. The product of my Summer of Code experience will be a comprehensive R package of phylogenetic comparative methods that will facilitate the use of phylogenetics in ecological research.

Student: Matthew Helmus

Mentor(s): Steven Kembel (primary), Ben Bolker

Project Homepage: Enhancing the representation of ecophylogenetic tools in R

Source Code:

Demo page: description of implemented functions

Graphviz navigation of databases

This is an application for the WebDot navigation of databases idea.

Abstract: A simple-to-integrate php framework which queries a database, maps database entry relationships with Graphviz, presents a clickable image map to the end-user, and then restarts the process centered around a user-selected node. Additional features will include "clickable edges" and user customization of graphing parameters and queries.

Student: Paul McMillan

Mentor(s): Brian O'Meara

Project Blog: Project related entries on my personal blog

Project Homepage: DBGraphNav on Google Code

Source Code: DBGraphNav on Google Code

Demo page: It has been implemented on the people and references pages of TreeTapper, for example, here

An Extension of Mesquite based on PDSIMUL

Abstract: PDAP implements models for continuous character evolution currently unavailable in mesquite. PDAP is written for DOS and may be inconvenient for users. I will integrate PDAP's character simulations into Mesquite, extended the Mesquite UI, the file format of saved projects, and the data structure Mesquite uses for character simulation.

Student: Matthew Ackerman

Mentor(s): Peter Midford (primary), Wayne Maddison

Project Homepage: An Extension of Mesquite based on PDSIMUL

Source Code: PDsimul

Demo page: Tentatively on the project home page

SAX based phyloXML support in BioPerl

This is an application for the phyloXML support in BioPerl.

Abstract: PhyloXML is an XML document model for phylogenetic data that incorporates various annotation types, including user customized data. The format is currently not supported by BioPerl. I propose a SAX based data structure and interface for PhyloXML support in BioPerl. I will use most of the existing IO structures such as TreeIO and TreeEventBuilder and subclass them to extend the functions specific to PhyloXML. The objects will be connected to various existing BioPerl modeules, such as SeqI, TaxonI, AnnotationI by reference in order to accomodate different phyloXML elements.

Student: Mira Han

Mentor(s): Chris Fields (primary), Jason Stajich, Rutger Vos, Christian Zmasek

Project Homepage: PhyloXML support in BioPerl

Source Code: Bioperl SVN repository

Demo page: phyloXML project Demo

Matrix display of phenotype annotations using ontologies in Phenote

This is an application for the Matrix display of phenotype annotations using ontologies in Phenote. (This project is supported by the Phenoscape project, not Google.)

Abstract: The matrix view will show character values grouped by taxon and character. It will be populated by consolidating phenotype annotations into characters by finding values which descend from the same attribute within the PATO phenotype ontology. Data export compatibility, flexible sorting and grouping options, and user-definable characters are also planned.

Student: Michael Wallinga

Mentor(s): Jim Balhoff (primary), Mark Gibson, Nicole Washington

Project Homepage: Matrix display of phenotype annotations using ontologies in Phenote

Source Code:

Demo page: Located on the project homepage


Note that the acceptance decisions for projects and students have meanwhile been made and have been published by Google.

Note: if there is more than one mentor for a project, the primary mentor is in bold font. Biographical and other information on the mentors is linked to in the Mentors section.

Students: The below are only our project ideas, albeit well though-out ones. You are welcome to propose your own project if none of those below catches your interest, or your idea is more exciting to you, provided it still falls into the area of phyloinformatics. Just be aware that we can't guarantee finding an appropriate mentor, but if we like your proposal we will try, or better yet, propose a mentor yourself. Regardless of what you decide to do, make sure you read (and follow) the guidelines for students below.

Phylogeny plotting for the R package phylobase

Methods for visualizing phylogenetic trees and associated data have not kept pace with growth in tree or dataset size. At a recent NESCent hackathon a new R package for comparative methods, phylobase, was designed. The phylobase package uses modern R classes and methods to store and manipulate phylogenetic trees. Currently plotting of phylogenetic trees in phylobase relies on the R base graphics, and functions in other packages. These implementations are not well suited to displaying either large phylogenetic trees or trees with large amounts of associated data.
The R programming language has two primary plot device interfaces, the base graphics interface and the newer, more extensible grid system. The grid system will allow for consistent scaling and resizing of trees and data. Current plot methods in phylobase are based on base graphics and suffer from resizing and layout difficulties. This project will develop new plot methods based on the grid system.
The primary challenge will be writing algorithms for efficiently converting tree structures to the grid language. Examples of similar algorithms exist for plotting using the old base graphics interface. This project will likely require programming skills not only in R, but also in C/C++.
Involved toolkits or projects 
R, phylobase, grid
Ben Bolker, Steven Kembel

Optimizing phylogenetic data representation in R

One result of the recent NESCent Hackathon on Comparative Methods in R has been the development of the phylobase package, which seeks to provide a set of S4 classes and methods for representing and manipulating phylogenetic trees and associated data in R. Phylobase contains structures for representing phylogenetic trees and associated data, but methods for tree manipulation, representation of multiple trees and metadata, and interfaces with other data formats (i.e. nexus, nexml) remain incomplete or have not been optimized for use with the large, multi-tree datasets that are increasingly common in bioinformatics and comparative biology. While the R language is extremely powerful and provides a rich feature set, it is inefficient at handling very large objects and heavy computational lifting (such as recursion, for-loops).
The methods for tree/data manipulation and import in phylobase are currently a mixture of S3 and S4 methods and C/C++ extensions. The goal for this project will be to implement efficient algorithms for tree and data representation and manipulation using object-oriented S4 classes and methods, and C/C++ extensions where necessary for performance. We suggest focusing on methods such as tree pruning, subsetting, and manipulation of multiple tree objects that are currently incomplete and will have the greatest impact on the ability to work with very large trees and datasets.
It would also be useful to improve interfaces with other data formats such as nexus and nexml that will be the likely source for import of trees, data and metadata.
The general challenge for this project will be to identify and implement optimized data structures for trees, multi-trees, associated data, and metadata, and methods (e.g., pruning, subsetting of trees and data). Identifying and evaluating the critical bottlenecks will require profiling and testing of code. This project will likely require programming skills not only in R, but also in C/C++.
Involved toolkits or projects
R, phylobase, Nexus Class Library, NeXML
Steven Kembel, Ben Bolker

Phyloinformatics Web-Services API

There is currently no common or even emerging web-service API for phylogenetic data (phylogenies, character data, transition models, node and tree metadata) or operations (comparing trees, inferring trees, inferring character evolution, supertree calculation, divergence time estimation, etc), which has lead to almost all and even major (such as TreeBASE) phyloinformatics data and analytical service providers lacking interoperability, and largely also programmability, which presents a major obstacle for building distributed applications and automated data and service discovery in phyloinformatics. Members of NESCent's Phyloinformatics and EvoInformatics groups have started to remedy this situation at the BioHackathon 2008 by developing use-cases, requirements, and an initial specification called PhyloWS. This effort has progressed far enough for implementing a prototype that will demonstrate and at the same time validate the API.
Due to the zillions of possibly useful analyses and queries on phylogenetic trees, properly scoping the project is one of the most important steps. We suggest to focus on the Phylogenetic Tree Database scope, for which the needed technologies, resources, and data are easily obtained, and which would also be an important building block for a generically deployable phylogenetic tree database. A draft specification for a RESTful phylogenetic tree provider API exists already. The implementation would include a server-side library and web-service endpoints, as well as a client-support library. The latter, while ideally being relatively generic in a chosen programming language, should probably best focus on an existing client platform, for example Perl modules integrated into Bio::Tree and Bio::DB::Tree with a demonstration script, or Java-client support using the Java Processing applet resulting from the PhyloWidget project from last year's Summer of Code.
Although a specification for the RESTful API exists, it is a draft and hasn't been implemented yet. Hence, following the specification may raise different kinds of issues that weren't initially anticipated, up to having to revise the specification.
Though scripts that implement most of the querying functionality were developed last year by Jamie Estill, they have not been tested or run in a server-environment, and they need at a minimum updating to be compliant with the current BioSQL/PhyloDB model.
If Java is chosen as the implementation language with an auto-generated persistence tier (see-below), some of the topological queries may require manual optimization to have sufficient performance.
Involved toolkits or projects 
BioSQL with its PhyloDB extension can be used for the database. Implementation of the data queries could be built on top of the Phy* scripts written by Jamie Estill as one of our 2007 Summer of Code projects; better yet, this code could be integrated into Bioperl-db, which is the BioSQL language-binding for BioPerl; this would open up the full functionality of the Bio::Tree classes within BioPerl, and by extension also that of the Bio::Phylo library.
Alternatively, using a Java-based approach much of the persistence and middleware tiers could be auto-generated (e.g., using NetBeans), taking advantage of JPA and Java5-style annotations for declarative persistence and web-service publishing.
Trees should probably be transmitted in an XML format, for which either NeXML or PhyloXML could be used. NeXML has received a lot of support lately, and one of the mentors is also the developer of NeXML.
Hilmar Lapp, Rutger Vos, William Piel

WebDot navigation of databases

GraphViz and utilities built on it are very useful for visualizing database structures (via sqlt-diagram); WebDot allows general networks to be navigated using a web interface. Many databases, including that of TreeTapper, could be helped by such an interface. For example, TreeTapper contains tables for questions, methods, software, and authors: each question may have zero or more methods to answer it, each method may be implemented in zero or more software packages, and the authors may have written methods and/or software and have networks of coauthors. Visualizing these relationships may help identify patterns; using them to navigate the website will allow users to zoom in to gather more information (on what platforms a program runs, what data a method requires). Just as sqlt-diagram allows easy creation of visual database overviews, there is a need for a framework to easily create navigation images for pages generated automatically from entries in tables in relational databases. Mappings of knowledge domains has become increasingly useful (see, for example, mapofscience) and making an easy way to do this would help many website developers and users.
Create a PHP script containing functions to take information from a relational database, format it for WebDot, and create code that can be inserted into a web page to create a navigation graphic.
Making the script flexible enough to work across a variety of databases.
Involved toolkits or projects 
TreeTapper, WebDot, GraphViz
Brian O'Meara

Implement PDSIMUL for Mesquite

The Mesquite tool for comparative phylogenetic data analysis and visualization currently has relatively limited facilities for simulation of continuous-valued evolutionary character traits. This project would implement the functionality of the PDSIMUL application from Theodore Garland's PDAP package (which is for MS-DOS). Although PDSIMUL is relatively old, it does implement a number of common types of continuous values evolutionary processes, including Ornstein-Uhlenbeck and various types of bounded Brownian motion. The original PDSIMUL implementation is limited to 2 variables, which can be modeled with a user-specified correlation. This project should extend this to a generalized CV matrix, in keeping with the Mesquite architecture paradigm.
Implement the PDSIMUL modeling functionality as a Mesquite package, consisting of a set of Mesquite modules, written in the Java language. Hopefully, students proposing to work on this or other Mesquite packages would know Java and ideally either Eclipse or another IDE, since most of the people involved in developing the Mesquite system use Eclipse. The main reason for desiring previous Java knowledge is that writing a Mesquite package requires learning portions of the Mesquite API, and trying to learn both Java and Mesquite's API at the same time would pose an extra challenge. A possible alternative would be for a student to learn Java in the process of coding a standalone Java version of PDSIMUL, which could then be converted into a Mesquite package in the future. If a student would like to consider this alternative, they should include a plan for learning Java as part of their proposal. Of course, any previous programming experience outside of Java, particularly in object-oriented languages, should considerably shorten this learning process and any such experience should be mentioned in a proposal.
Extending the simulations to more than two dimensions.
Developing a user interface that is Mesquite-compatible from the previous UI that mostly revolved around a single dialog screen.
Involved toolkits or projects 
Peter Midford, Wayne Maddison

Parametric bootstrapping using CIPRES ReST services

Parameteric bootstrapping is a flexible, powerful approach to assessing the statistical significance of a result. It entails generating the null distribution for a test statistic using Monte Carlo simulation. The approach has been advocated in the context of phylogenetics (e.g. the discussion of the SOWH test by Goldman et al, Systematic Biology, 2000) because it is often impossible to derive the null distribution of phylogenetic statistics analytically. Despite its strengths, the technique has been underutilized because it involves mastery of several software tools - for example phylogeny inference packages, and tree-dependent data simulators. These tools can be computationally expensive and are not at all user-friendly. With the recently released ReST-based API of the CIPRES portal, and data simulators being planned for deployment to the portal this spring, there is an opportunity to develop a general and reusable (i.e., developer-friendly) library and a user-friendly front-end tool that enable widespread use of parametric bootstrapping for calculating support of branches on a tree.
Develop a reusable and lightweight library and proof-of-concept tool, for example in a chosen Bio* toolkit, that can calculate support for a selected branch on a tree by parametric bootstrapping. Data simulation and tree inference steps can be performed on the CIPRES public portal using the recently released ReST APIs.
Data simulators are planned for deployment only this spring and not yet available on the CIPRES portal (though the tools exist, for instance SeqGen and AASeqGen, and are robust). This project will inevitably be one of the original testers of that service.
The description of a model of character evolution in an application-neutral way is not trivial. On the plus side, the development of a "transition model language" is one of the targets being addressed by the Evolutionary Informatics working group, with which CIPRES collaborates. The specification of the model (an output of the original inference steps, and an input into the data simulators) may, however, change during the life of this project.
Involved toolkits or projects 
CIPRES, the appropriate Bio* toolkit (depending on choice of implementation language).
Mark Holder

A service for annotating digital objects with ontology terms

Phyloinformatics data are increasingly being typed and semantically "tagged" using ontologies. For example, see the PhyloDB model of BioSQL, and the EvoInformatics General Ontology group. As more documents and datasets are submitted to digital archives, standard "ontology services" for lookup of vocabularies, terms, and hierarchical relationships, as well as structured navigation, become increasingly important. Ontology services may be used by disciplines ranging from bioinformatics to digital library and information sciences. Although ontologies provide better organization and discovery than simple tagging systems, it is currently too difficult for individual users to apply ontology terms to their digital objects. A flexible system for applying ontologies will improve scientific data (and document) repositories as well as general-purpose repositories that currently rely on tagging systems.
The LexGrid framework and BioPortal interface will form the core of this project. The student will create an interface that allows users to submit a text document (e.g., as extracted from the digital object) to LexGrid, browse the relevant ontologies for term matches and the sections they are in, and select terms to be applied as annotation to the digital object.
LexGrid is a relatively large project, and it will take some time to become accustomed with the codebase. A system must be developed for browsing multiple ontologies on a single web page, allowing users to select between similar terms with minimal effort.
Involved toolkits or projects 
The LexGrid framework, using the LexBIG and BioPortal source packages, will form the core of the service. Some features of the Ontology Lookup Service may be useful for creating the browsing interface.
Ryan Scherle, Daniel Rubin

Enhancing the PhyloCode Registry with Topological Query Capabilities

The PhyloCode is a set of rules governing clade-based taxonomy. A prototype database called RegNum has been developed to serve as a registry of PhyloCode names. To aid in automatic resolving of PhyloCode names on locally and remotely stored trees, PhyloCode specifiers need name-string intelligence using name-based web services and PhyloCode definition types need to be translated into topological queries.
Given that RegNum already exists as a fairly stable basic registry database, we suggest that the student take this code-base and enhance it with the following features: name-string intelligence for specifiers, i/o and storage of trees as transitive closure paths, and development of queries for resolving PhyloCode names to clades on trees. Development of topological queries can benefit from existing efforts, such as GSoC 2007 additions to BioSQL. Name string services are available from uBIO and emerging from EoL. Large trees that span all of life are available from Tree of Life Web Project, NCBI, ITIS, etc.
Building on existing RegNum codebase means becoming familiar with Hibernate and Cocoon. Gathering name-string intelligence from name services is non-trivial, as these services are in flux. Translation of PhyloCode definitions into topological queries is fairly straight-forward except where specifiers involve synapomorphies, which will require an imaginative solution.
Involved toolkits or projects 
RegNum can be used as a core registry database and has excellent documentation; the BioSQL PhyloDB extension can be adapted to RegNum for indexing tree topologies. ReST web services from uBIO can be used for name-string intelligence. A tree-of-life is served by the Tree of Life Project, which can be enhanced by fusion with available classifications.
William Piel, Shirley Cohen, Hilmar Lapp

phyloXML support in BioPerl

Currently phylogenetic tree formats have limited support for attaching annotation and cross-references to the data. The recent development of an XML standard for phylogenetic trees, phyloXML, provides a format to encode phylogenetic relationships and additional information. Support from various toolkits has yet to be developed. BioPerl is one toolkit that has support for many aspects of phylogenetic trees and API and objects for representing sequences, alignments, annotation, and phylogenetic information. By building support for phyloXML into BioPerl, tools can be built which encode the links between the data, analyses, and a resulting phylogenetic tree providing an encapsulated result beyond simply the tree.
A straightforward approach to implement an XML parser in BioPerl can build on the existing Tree Parsing modules (see Bio::TreeIO) and utilize pre-existing XML parsing support in Perl.
Efficiently implementing the parser should be very easy. Additional data structures may need to be designed or reused from existing BioPerl objects to fully capture all annotation that could be encoded in the phyloXML format. Some work to support an ontology for annotated data could also be explored.
Involved toolkits or projects 
BioPerl and phyloXML.
Jason Stajich, Chris Fields, Rutger Vos, Christian Zmasek

phyloXML support in BioRuby

Phylogenetic trees are at the center of comparative genomics studies. Trees used in this context (sometimes as input, but mostly as results) are annotated with a variety of data elements, such as taxonomic information, genome-related data (gene names, functional annotation), and information related to the phylogeny itself (branch lenghts, support values). phyloXML is an XML data exchange standard that can represent this data. phyloXML can be displayed via ATV, which also allows manipulation and navigation of the tree. Currently, there is no support for phyloXML in any of the open source Bio* projects.
Build phyloXML support in the increasingly popular objected oriented dynamic language Ruby. More specifically, extend the open source BioRuby project to support phyloXML. This will probably entail (i) the development of objects to represent all the elements of phyloXML (sequences, taxonomic data, annotations, etc), (ii) the development of a parser to read in phyloXML, and (iii) a phyloXML writer.
Relating the data elements specific to phyloXML to the tree classes already in BioRuby (while "keeping it simple"). Development of the parser.
Involved toolkits or projects 
BioRuby, phyloXML
Christian Zmasek, Pjotr Prins

A Phylogenetic Tree Mining Service Over MapReduce

Mining patterns in large numbers of phylogenetic trees requires the ability search for topologies and compute parameters over large datasets of trees. Distributing both data and tasks across commodity servers promises to greatly accelerate these queries and calculations.
We propose to build a tree mining service in Java, on top of Hadoop, an open-source implementation of MapReduce, and deploy this over a cluster of machines. We will be using either the CIPRES compute cluster or the NESCent shared cluster resource for testing this on an actual computing cluster.
  1. Learn about the MapReduce architecture and play with Hadoop's imlementation of MapReduce.
  2. Design a simple and intuitive API for users who are both computational biologists and technically savvy enough to program. The API will include a topological tree matching option with some advanced search criteria like max/min branch lengths and number of taxa to search.
  3. Implement appropriate hashing functions for grouping large numbers of trees such that they can be split as evenly as possible across the machines in the cluster. This requirement corresponds to the map function in the map-reduce framework. Note that the choice of a hashing function will depend on the input query.
  4. Implement both standard and non-standard reducer functions. Standard reducer functions will compute phylogenetically-useful statistics over groups of trees such as average tree height, maximum likelihood/parsimony score, minimum branch length. Non-standard reducer functions will match tree instances given a target tree. Note: A standard reducer function returns a single value per mapped key and a non-standard reducer functions operates over all the keys in the map, but returns more than one value per mapped key.
  5. Conduct performance tests on different shapes of trees (e.g. long and bushy) and different sizes of datasets (small, medium, large).
  6. Document the code such that it can be disseminated to the computational biology community.
Involved toolkits or projects 
Hadoop with synergy from other projects that involve developing a phyloinformatic API, such as Phyloinformatics Web-Services API
Shirley Cohen, Hilmar Lapp, and William Piel

Matrix display of phenotype annotations using ontologies in Phenote

Phenote is an application which facilitates the annotation of biological phenotypes using ontologies. While initially developed for model organism mutants, the Phenoscape project has extended Phenote to efficiently annotate evolutionary phenotypes. Phenotype annotations are currently edited within Phenote in a list interface, allowing new phenotypic states to be annotated without pre-defining characters. However, a way to view annotations in a matrix form, showing character values grouped by taxon and character, would be familiar to evolutionary biologists and aid in evaluation of taxonomic annotation coverage and phylogenetic patterns. Development of this view would also lead to data export compatibility with matrix-oriented evolutionary software.
Implement a matrix view within Phenote using Java Swing. Develop algorithms for populating this view, by consolidating phenotype annotations into characters using the ontology hierarchical structure. Phenotypic states can be grouped into characters algorithmically by finding values which descend from the same attribute within the PATO phenotype ontology, as implemented in Phenote's prototype NEXUS exporter. Given enough time, it would also be valuable to allow the user to define characters by hand.
Become familiar with OBO ontologies and their use in phenotypic annotation. Add flexibility to the matrix view to narrow the display to subsets of particular ontologies or to provide options for how annotations are grouped into characters.
Involved toolkits or projects 
Phenote, OBO-Edit
Jim Balhoff, Mark Gibson, Nicole Washington

Freeing Geophylobuilder: Implementing an Web-Based Java Client (GeoPhyloBuilder-J)

Geophylobuilder for ArcGIS (Kidd and Liu, 2008) creates geophylogenies from input evolutionary trees and associated distribution data. Geophylogenies are spatiotemporally georeferenced phylogenies; the same data structure can also be used to represent reticulate spatiotemporal relationships between any type of geographical entities including islands, continents, watersheds and areas of endemism. Geophylobuilder for ArcGIS, which itself is open-source, was developed in .Net using the ESRI ArcObjects COM library, hence an (expensive) software license is required for its use. Two websites, the (experimental) CIPRIES tree server and SupraMap already support the creation of simple geophylogenies in KML format from a tree file and associated tip point locations. Neither of these is open-source, and SupraMap is further restricted by requiring data to be in POY4 format with an apomorphy list.
This project will develop an open-source web-application version of Geophylobuilder written in Java (Geophylobuilder-J). Geophylobuilder-J should support input of a tree in NEWICK format and sample locations as points, lines, or polygons encoded as a .csv or shapefile. It should also support n:m relationships between tree tips and observations. Allowing user-defined node positioning would allow geophylogenies to be built using the output of ecological niche models or other historical biogeographical reconstruction approaches. Geophylogenies should be output as KML files for viewing in earth browsers or in the shapefile format that can be read by many programs including the R-statistics package.
The main challenges are in designing the web interface, and developing the KML structure that contains data and metadata. Skills (simple) computational geometry will help. Java and JavaScript programming skills will be needed.
Involved toolkits or projects 
Jump Shapefile API, Google Earth/NASA WorldWind
David Kidd, Xianhua Liu

Web-based application for representing, manipulating and visualizing comparative data

The scale and integrative nature of contemporary evolutionary research call for an easy-to-use, interactive and dynamic software for scientists to visually interact and manipulate complex phylogenetic data (in particular character data and trees), instead of cumbersome hands-on manipulation. The ideal application is cross-platform and web-based application, highly interactive with quick response, and provides efficient remote data-management features. The recent availability of the Google Web Toolkit (GWT), a Java-based framework for developing dynamic AJAX-based web applications, and the Bio::NEXUS Perl library for efficiently representing and manipulating comparative data, greatly ease developing such software. 
Develop a web application to upload, generate graphical views of, and to manipulate in an integrated way trees with character data (from, and as contained in, NEXUS-formatted input files). To be useful, the application should generate a view of a tree with its tips aligned to the relevant rows of a sequence alignment (or other character matrix), and support basic tasks such re-rooting the tree, or selecting (excluding) a subset of data defined via a subtree. These operations should be conveniently controlled by point-and-click with a mouse, for example on the relevant node of the tree.
The Nexplorer server provides an example of this kind of interface, using a JavaScript front end to manipulations carried out by Bio::NEXUS methods. A more modular design based on GWT (instead of JavaScript) will allow a much more powerful application that is easier to maintain and extend. Depending on the rate of progress, the student may be able to implement more sophisticated interface elements (e.g., a horizontally scrollable pane for viewing long sequence alignments), or to provide interfaces to external phylogenetic tools that carry out computations on the tree or data (e.g., reconstruct ancestral sequences that can be viewed via the web application).
  1. Understand the integrated use of Java, JavaScript, HTML, CGI and Perl to build a dynamic, easy-to-use, and easy-to-maintain web application.
  2. Develop XML or JSON object models to exchange data and actions between GWT modules and Bio::NEXUS objects
  3. Understand the Bio::NEXUS and GWT APIs to efficiently design browser-based UIs and CGI queries.
  4. Implement mod_perl or fastCGI extension of CGI to efficiently manage data in the server to improve CGI startup/initialization.
Understanding of Object Oriented Programming concepts and programming skills in Java or Perl will be required.
Involved Projects 
GWT (Google Web Toolkit), Bio::NEXUS, Nexplorer and FastCGI
Vivek Gopalan, Arlin Stoltzfus


What should prospective students know?

Note that the acceptance decisions for projects and students have meanwhile been made and have been published by Google.

When you apply

When applying, (aside from the information requested by Google) please provide the following in your application material.

  1. Your interests, what makes you excited
  2. Why you are interested in the project, uniquely suited to undertake it, and what you anticipate to gain from it
  3. A summary of your programming experience and skills
  4. Programs or projects you have previously authored or contributed to, in particular those available as open-source, including, if applicable, any past Summer of Code involvement
  5. A project plan for the project you are proposing, even if your proposed project is directly based on one of the ideas above.
    • A project plan in principle divides up the whole project into a series of manageable milestones and timelines that, when all accomplished, logically lead to the end goal of the project. Put in another way, a project plan explains what you expect you will need to be doing, and what you expect you need to have accomplished, at which time, so that at the end you reach the goals of the project.
    • A compelling project plan will require you to spend significant thought on the project itself and how one might want to go about the work. While we won't outright reject student applications without a project plan, we do expect that without one your application won't be competitive, even if your skills seem to match or beat others. We are most interested in students who are excited and attracted by the project they are proposing, and if you aren't willing to spend the time to come up with a project plan, your excitement will be hard to demonstrate.
    • We strongly recommend that you bounce your proposed project and your project plan draft off our mentors by emailing (see below).
  6. Your possibly conflicting obligations or plans for the summer during the coding period.
    • While there are no hard and fast rules about how much you can do in parallel to your Summer of Code project, we do expect the project to be your primary focus of attention over the summer. If you look at your Summer of Code project as a part-time occupation, please don't apply for our organization.
    • That said, if you have the time-management skills to manage additional jobs or obligations concurrent with your Summer of Code project, feel encouraged to make your case and support it with evidence.
    • Most important of all, be upfront. If it turns out later that you weren't clear about other obligations, at best (i.e., if your accomplishment record at that point is spotless) it destroys our trust.

Get in touch with us

Please send any questions you have, and projects you would like to propose, to This will reach all mentors and our Summer of Code organization adminstrators. You can also reach the (a little larger) community of previous mentors, students, hackathon participants, and other phyloinformatics software developers at

  • We strongly recommend you do this even if you want to work directly on one of our project ideas above. It gives you an opportunity to get feedback on what our expectations might be, and you might want to ask for more specifics. Also, you can bounce your project plan off of us.
  • Keep in mind that we are most interested in students who give us evidence that they have already or might develop a sustained interest in becoming future contributors to our open source projects. If the first time we see something from you is when we score your application, it is much more difficult for you to convey that.
  • A major component in becoming part of a project and a community, even if only temporarily, and also of working remotely is communication. If you can't be bothered to engage in communication with your prospective organization and/or your mentor before you have been accepted, with what evidence will you argue convincingly that you are a good communicator?

Other information

  • You can visit our application document with Google's questions and our answers. The part that isn't on this page or linked from here already is mostly in the last couple of questions on mentors and students.
  • For questions of eligibility, see the GSoC eligibility requirements for students. These requirements must be met on April 14, 2008.
  • There is also a Google group for posting GSoC questions (and receiving answers; note that you will need to sign up for the group) that relate to the program itself (rather than be specific to our organization).
  • Students receive a stipend from Google if accepted. See the Google SoC FAQ on payments for full documentation.

Reference Facts & Links

Projects involved

Bio* projects 
The umbrella organization for the Bio* projects is the Open Bioinformatics Foundation (O|B|F). O|B|F is governed by a Board of Directors, organizes the Bioinformatics Open Source Conference (BOSC) on an annual basis since 2001, and provides hardware and system administration for the member projects.
The individual member projects are BioPerl, Biojava, Biopython, Bioruby, and BioSQL. Except for the latter, which provides a generic schema for certain life science data types, each of these projects represents the largest and most widely used toolkit in its respective language in the life sciences.
Perl Bio:: projects 
Bio::NEXUS is the only Level-III compliant parser of the NEXUS file format for phylogenetic data in the Perl programming language. Bio::CDAT is a container architecture for Character Data And Trees. CDAT is in the early stages of its development. The perceived end result is an architecture that keeps track of Bio::* data objects in ways that are applicable to the task at hand. For example, a CDAT subclass could manage the relationships between a tree and multiple data columns for input in comparative analyses. Bio::Phylo is an API for phylogenetic data used by the CIPRES project. Bio::Phylo is intended to be compatible with BioPerl and CDAT, while functioning as a petri dish for new object designs.

Google Summer of Code 2008