Phyloinformatics Summer of Code 2009

From Phyloinformatics
Jump to: navigation, search

We were accepted again as a mentoring organization for the 2009 Google Summer of Code (GSoC) program. This was our third time in the program.

On this page we were first collecting ideas, possible projects, prerequisites, possible solution approaches, mentors, other people or channels to contact for more information or to bounce ideas off of, etc. You can now also find the projects that were accepted with brief summaries of the goals, who the student and mentor were for each project, and where project page, source code repository, and, if the student kept one, a development blog are located. After completion of the program in August 2009 the students wrote up a short summary for each project.

Please see Phyloinformatics Summer of Code 2007 and Phyloinformatics Summer of Code 2008 for information about our participation during the previous two years. We also wrote up the project results in the fall editions of our 2007 and 2008 NESCent newsletters.



Our organization administrators are Hilmar Lapp ( and Todd J. Vision (

If you are a student interested in applying for a Google Summer of Code project in our organization, please send any questions you have, projects you would like to propose, etc to This will reach all mentors and our Summer of Code organization adminstrators. You can also reach the (larger) community of previous mentors, students, hackathon participants, and other phyloinformatics software developers at

During the student application period (which is now over) we hang out regularly on IRC at least on weekdays during working hours (EDT) in #phylosoc on Freenode. You're welcome to join us at any time, though we may not be on-line outside of those times. Alternatively, there's always email. (If you do not have an IRC client installed, you might find the comparison on Wikipedia, the Google directory, or the IRC Reviews helpful. For Macs, X-Chat Aqua works pretty well. If you have never used IRC, try the IRC Primer at IRC Help, which also has links to lots of other material.)

Before you apply, please read our guidelines and expectations . We don't have an application format that you need to adhere to, but we do ask that you include specific kinds of information (see "When you apply").


Our aim is to cultivate the next generation of open-source software developers in evolutionary biology and promote the open, collaborative development of reusable, interoperable, standards-supporting informatics tools. We aim to increase the awareness and understanding of the value of open-source and collaborative development, and also impart the programming and remote collaboration skills needed to successfully contribute to such projects.

The disciplinary focus for our participation is phyloinformatics and evolutionary biology, and our community of mentors are leaders in this field. The code that students write will facilitate new, and increasingly complex, scientific questions to be answered about the patterns and processes underlying the history of life. Work on projects along the lines of our suggested project ideas is bound to make an impact. The range of problems for which students can make meaningful contributions to is diverse, enabling us to accommodate students with varying interests. Similarly, the project ideas cover a variety of programming languages and skill levels.

Accepted Projects

Note that the accepted projects, students, and their mentors are also published by Google. There is also a tabular version. After the completion of the 2009 program, we worked with the students to write up a short summary for each project.

GPU acceleration for phylogenetic inference using OpenCL

This is an application for the GPU acceleration for phylogenetic inference using OpenCL project idea.

Abstract: This project will implement an open-source C++ library to compute phylogenetic tree likelihoods on the GPU, with the core loops in OpenCL. Currently in phylogenetic inference there is a strong demand for more computational speed and typically likelihood calculation is the main bottleneck. Moving to GPU acceleration is a natural step to address that issue and this project has the potential to benefit several state-of-the-art evolutionary analysis packages.

Student: Daniel Ayres

Mentor(s): Aaron Darling (primary), Marc Suchard

Project Homepage: beagle-lib GSoC 2009 page on Google Code

Source Code: beagle-lib OpenCL branch on Google Code

Biogeographical Phylogenetics for BioPython

This project was proposed by Nick Matzke.

Abstract: Create a BioPython module that will enable users to automatically access and parse species locality records from online biodiversity databases; link these to user-specified phylogenies; calculate basic alpha- and beta-phylodiversity summary statistics, produce input files for input into the various inference algorithms available for inferring historical biogeography; convert output from these programs into files suitable for mapping, e.g. in Google Earth (KML files).

Student: Nick Matzke

Mentor(s): Stephen Smith (primary), Brad Chapman, David Kidd

Project Homepage: BioGeography

Source Code: The source code is in the Bio/Geography directory of the Geography fork of the nmatzke branch on GitHub, and you can see a timeline and other info about ongoing development of the module here. The new module is being documented on the BioPython wiki as BioGeography.

Biopython support for parsing and writing phyloXML

This is an application for a modification of the phyloXML support in BioRuby project idea (namely, choosing Biopython as the Bio* library instead of BioRuby).

Abstract: PhyloXML is an XML format for phylogenetic trees, designed to allow storing information about the trees themselves (such as branch lengths and multiple support values) along with data such as taxonomic and genomic annotations. Connecting these pieces of evolutionary information in a standard format is key for comparative genomics. A Bioperl driver for phyloXML was created during the 2008 Summer of Code; this project aims to build a similar module for the popular Biopython package.

Student: Eric Talevich

Mentor(s): Brad Chapman (primary), Christian Zmasek

Project Homepage: PhyloSoC:Biopython support for parsing and writing phyloXML

Source Code: Biopython phyloxml branch on GitHub

Implementing phyloXML support in BioRuby

This is an application for the phyloXML support in BioRuby project idea.

Abstract: Phylogenetic trees have numerous application in biological and biomedical research, including comparative genomics, phylogenomics, phylogeography, gene function analysis, systematics, and the study of molecular evolution. Phylogenetic trees used in these applications are annotated, on the branch and node level, with a variety of data elements, such as taxonomic information, genome-related data (gene names, functional annotation), and information related to the phylogeny itself (branch lengths, support values). phyloXML is an XML data exchange standard that can represent this data. The goal of this project is to implement support for phyloXML in BioRuby.

Student: Diana Jaunzeikare

Mentor(s): Christian Zmasek (primary), Pjotr Prins, Naohisa Goto

Project Homepage: PhyloSoC:PhyloXML_support_in_BioRuby

Project Blog: Personal Blog tag GSOC 2009

Source Code: BioRuby PhyloXML branch on GitHub

BioPerl integration of the NeXML exchange standard and Bio::Phylo toolkit

This is an application for the BioPerl integration of the NeXML exchange standard + Bio::Phylo toolkit project idea.

Abstract: This project will integrate the NeXML exchange standard into BioPerl, facilitating the adoption of this standard and easing the transition from the overworked NEXUS standard. A wrapper will be used to allow BioPerl native access to the preferred NeXML parser (Bio::Phylo), allowing Bio::Phylo and NeXML to co-evolve without being encumbered by BioPerl. Additionally, test cases and example sets will be developed that target real world uses

Student: Chase Miller

Mentor(s): Mark A. Jensen (primary), Rutger Vos

Project Homepage: PhyloSoC:BioPerl integration of the NeXML exchange standard and Bio::Phylo toolkit

Project Blog: GSOC Nexml Blog

Source Code: Bioperl-live branch

Mapping the Bio++ Phylogenetics toolkit to R/BioConductor and BioJAVA using BioLib

This is an application for the Mapping the Bio++ Phylogenetics toolkit to BioPerl, R/BioConductor and BioJAVA using BioLib project idea.

Abstract: Bio++ is a C++ library for phylogenetics algorithms. Unfortunately, few bioinformatics scientists work in C++. The plan is therefore to translate the library into multiple other languages, including Java and R. (Perl, Python, and Ruby have been done.) Doing so should allow us to reach the majority of active bioinformaticians.

Student: Adam A. Smith

Mentor(s): Pjotr Prins (primary), Chris Fields

Project Homepage: PhyloSoC:Mapping the Bio++ Phylogenetics toolkit to R/BioConductor and BioJAVA using BioLib

Project Blog: Blog

Source Code: Git Repository

A BioLib mapping for the libsequence population genetic libraries

This is an application for the A BioLib mapping for the libsequence population genetics libraries project idea.

Abstract: BioLib brings together a set of opensource C/C++ libraries and makes them available to all Bio languages, like BioPerl and BioPython, utilizing SWIG to map the C/C++ to Bio languages. Libsequence is a C++ library for modern genetic simulation. The overall goal of the project is to add libseqence into BioLib and build C++ mapping to Python/Perl for libseqence, so that researchers can directly use modules from libsequence in script language program without writing special interface program.

Student: Xin Shuai

Mentor(s): Chris Fields (primary), Mark A. Jensen and Pjotr Prins

Project Homepage: PhyloSoC:A BioLib mapping for the libsequence population genetic libraries

Project Blog:Mapping Libsequence in BioLib

Source Code: libseqence branch of BioLib on github

Build a Mesquite Package to view Phenex-generated Nexml files

This is an application for the Build a Mesquite Package to view Phenex-generated Nexml files project idea.

Abstract: Extend Mesquite's capabilities to represent character matrices by allowing for the addition of ontology terms produced by Phenex in the Nexml format. Mesquite currently supports the Nexml file format and by the end of this project would be able to display Phenex's files directly in Mesquite, and depending on the speed of development, would display chained ontology concepts in graphical format with the use of a library like GraphViz.

Student: Kasia Hayden

Mentor(s): Peter Midford (primary), Jim Balhoff

Project Homepage: PhyloSoC:Build a Mesquite Package to view Phenex-generated Nexml files

Project Blog: Wordpress GSoC blog

Source Code: Google code

Enhance the searching functionality of Phylr

This is an application for the Enhance the searching functionality of Phylr project idea.

Abstract: PhyloWS is an emerging standard for interacting with phylogenetic data via web services. Phylr, is an initial implementation of the SRU search in PhyloWS. The goals of this project are to enhance the Java code that translates between the SRU server and a Lucene index and make it fully conform to the PhyloWS specs, modify the XSL stylesheets to present a more intuitive interface to the users, and write a new Java adapter that translates between the SRU server and a relational database.

Student: Dazhi Jiao

Mentor(s): Ryan Scherle (primary), Lucie Chan

Project Homepage: PhyloSoC:Enhance the searching functionality of Phylr

Project Blog: Phylr GSoC blog

Source Code: Phylr GoogleCode site

Semantic phyloinformatic web services using the EvoInfo stack

Student John Harney submitted an application for the Semantic phyloinformatic web services using the EvoInfo stack project idea also to the 2009 Cyberinfrastructure Summer Traineeship program and was accepted there.


Note: if there is more than one mentor for a project, the primary mentor is in bold font. Additional information about the mentors is linked to in the Mentors section.

Students: we have listed only our own project ideas, albeit well thought-out ones. You are welcome to propose your own project provided it still falls within the scope of the aims described above. If we like your proposal, we will try to find an appropriate mentor. Better yet, propose a mentor yourself! Regardless of what you decide to do, make sure you read and follow the guidelines for students below.

Semantic phyloinformatic web services using the EvoInfo stack

The evolutionary informatics working group at NESCent has produced a stack of deliverables to promote interoperability in evolutionary analysis. Among these are the comparative data analysis ontology and the NeXML exchange format. In addition, at a recent hackathon at NESCent, supporting tools were produced to transform NeXML to CDAO RDF/XML triples. Subsequently, at a hackathon in Japan, it was shown that these technologies can be leveraged for the development of semantic web services, which would open up a rich field of possibilities, including smarter service discovery and complex reasoning over web service workflows. This project idea is for a student interested in bringing all these parts together and implement a proof-of-concept web service that accepts and emits NeXML in such a way that a standards-compliant client (e.g. SADI) can use this service in a larger semantic workflow.
One part of this project involves the creation of a simple web service that produces and consumes NeXML, either in Java or in Perl. The exact function of the service is for the successful applicant to decide, though a requirement is that the implementation doesn't take a lot of time, as the meat of this project lies in making the infrastructure work. (In fact, the applicant could also decide to collaborate with the Phylr applicant and use phylr as a back-end.) The project would involve extending the work done on transformation of nexml to cdao and back. These transforms would be made available on the web, and referenced by a SAWSDL-annotated web service description, respectively as the value of the liftingSchema and loweringSchema attributes. In addition, the service description would identify the NeXML being consumed and produced with reference to the CDAO ontology using modelReference attributes. In principle, this should be sufficient for SADI to consume the service description and use it in its workflows. In practice, a large part of the work will involve debugging the various moving parts, in fine-tuning the NeXML<=>CDAO mappings and in debugging the NeXML<=>CDAO transforms.
A lot of new semantic web technologies are involved: SAWSDL, OWL, WSDL, XSLT, possibly SPARQL. This make this project an excellent opportunity for learning about the Semantic Web and its practical application.
Involved toolkits or projects 
NeXML, CDAO, web service toolkits (e.g. applicable eclipse plugins or CPAN modules), client usage of SADI.
Degree of difficulty and needed skills 
Medium. The successful applicant needs a solid background in XML processing and web services (not necessarily SOAP, could be REST/HTTP) and be interested in extending his/her knowledge of the semantic web (ontologies, semantic annotations). In addition, the student needs a solid background in either Perl or Java.
Rutger Vos

GPU acceleration for phylogenetic inference using OpenCL

The Maximum Likelihood and Bayesian techniques for phylogenetic inference use intensive floating-point arithmetic to compute the likelihood of a tree topology, branch lengths, and other model parameters given the data. Nearly all implementations of such methods compute tree likelihoods via the peeling algorithm, and usually computing the tree likelihood is the computational bottleneck. In practice, this calculation gets coded as a loop over products and sums of floating point numbers.
Meanwhile, GPUs have advanced to the point that a single card may have 512 or more floating point processors. Recent cards can do double-precision math. GPU implementations of Profile HMMs for sequence alignment have demonstrated 30-100x speedup on inexpensive video cards, and implementations of Molecular Dynamics engines have shown 100-700x speedups.
This project would involve implementing a small C or C++ library to compute tree likelihoods. The core loops will be implemented in OpenCL, which is the first unified, hardware-vendor-independent API for programming parallel apps on GPUs. The first OpenCL implementation is available in the latest Apple Developer seeds of Mac OS X 10.6, and access to a Mac running a licensed prerelease instance of OS X 10.6 will be provided. By late this year, OpenCL should be implemented more widely, eventually on windows, linux, and other major OSes. For datasets with many short sequences, parallelization on such hardware may not be attractive because the tree structure imposes dependencies in the computation. But for the long sequences in genomic datasets and for multi-gene studies that have become commonplace, the loop can be parallelized across the sequence dimension instead of across tree branches.
The major design goals of the library would be:
  1. that it implement the absolute minimum interface to compute tree likelihoods
  2. that the interface be clean and well-documented
  3. that it provides a good speedup, of course :)
  4. bindings for higher-level languages (Java, Python, etc.) should be easy to create
The tree likelihood library can then be integrated with existing ML and Bayesian codes, or be used as the basis for totally new and experimental methods.
learning a new API (OpenCL) that very few people know may be challenging.
Involved toolkits or projects 
BEAST, which is an open-source tree inference package and will be used as a test implementation for the library. BEAST already implements a likelihood calculator that can be adapted to OpenCL, and a development project with mailing lists and subversion repositories exists on google code.
Degree of difficulty and needed skills 
Medium-to-hard. Prior experience with C or C++ is necessary, and prior experience programming graphics hardware would be helpful. Prior experience with Nvidia's CUDA language for graphics programming would make this an easy-to-medium project. Knowledge of automake and GNU library packaging standards would be helpful or can be easily learned along the way.
Other requirements 
Apple hardware with a suitable graphics card. The student can either develop on their own hardware or we will provide remote access to hardware.
Aaron Darling and Marc Suchard

phyloXML support in BioRuby

Evolutionary trees are central to comparative genomics studies. Trees used in this context are usually annotated with a variety of data elements, such as taxonomic information, genome-related data (gene names, functional annotations) and gene duplication events, as well as information related to the evolutionary tree itself (branch lengths, support values). phyloXML is an XML data exchange standard that can represent this data. Trees in phyloXML format can be displayed and analyzed with Archaeopteryx (the successor to ATV), which also allows manipulation and navigation of the tree. While tools exist to convert other formats (such as the widely used Newick and Nexus formats) to phyloXML, there is currently support for phyloXML in only one of the open source Bio* projects (in BioPerl, as a result of Google's Summer of Code 2008).
Build phyloXML support in the increasingly popular, dynamic, and fully objected oriented language Ruby. More specifically, extend the open source BioRuby project to support phyloXML (BioRuby 1.3.0 has just been released). This will entail (i) the development of objects to represent all the elements of phyloXML (sequences, taxonomic data, annotations, etc), (ii) the development of a parser to read in phyloXML, and (iii) a phyloXML writer.
Relating the data elements specific to phyloXML to the tree classes already in BioRuby while maintaining the standards of the BioRuby project. Development of a time and memory efficient phyloXML parser (the parser has to be able to process trees with thousands of external nodes, at least).
Involved toolkits or projects 
BioRuby, phyloXML
Degree of difficulty and needed skills 
Medium. Requires experience in an object oriented programming language (such as C++, Java, or, ideally, Ruby). Experience in genomics or a related biological field is also critical. Knowledge of BioRuby will obviously help, as well as familiarity with XML.
Christian Zmasek, Pjotr Prins (looking for co-mentors, especially experts on BioRuby)

Extend the NEXUS Class Library to parse nexml and the NOTES block in NEXUS

nexml is emerging as a standard format for expressing phylogenetic data. The NEXUS Class Library (NCL), is a mature parser for the NEXUS format (and some other simple formats). Currently NCL does not do a rich parse of the NOTES block in NEXUS (a block that stores metadata about taxa, characters, and trees). Nor does it read nexml. By adding a generic interface for associating metadata (as strings) with core objects, NCL can provide an API rich enough to parse the NOTES block of NEXUS and a nexml file. This would make this data available to client code (which include several projects used throughout the phylogenetics community ( GARLI, phycas, Brownie and phylobase, the CIPRES portal, MrBayes v4), and would allow that software to read nexml.
Design an API for returning hierarchichal notes for core parts of the phylogenetic data analysis (a group of taxa, a taxon, a set of trees, a tree, a node, a cell of a character matrix...). Implement parsing of a NOTES block to so that the content of notes can be returned in this way. Implement parsing of nexml so that content can be put into NCL's data structures (and queried through the extended API).
nexml is quite rich in terms of accommodating meta-data. While attaching content to the core data is easy in dynamic languages, it will be challenge to keep track of the metadata within C++. Fortunately, the core data model of nexml and NEXUS are similar (the core collections are taxa, character matrices, and trees). Adding metadata about these three entities is supported in NCL, but not the types of meta-data are not terribly rich. Fortunately, storing hierarchical structured strings (like a DOM subtree) connected to the correct object will be sufficient in most cases. Documentation of the newest NCL features is not great (fortunately the author of those extensions, Mark Holder, is a proposed mentor; so questions can be handled efficiently while the docs are improved).
Involved toolkits or projects 
NCL, SWIG, nexml, Xerces (or some other XML parsing library for C/C++)
Degree of difficulty and needed skills 
Medium difficulty for a programmer who is comfortable with C++. NCL templates and some multiple inheritance and a tiny bit of compile-time polymorphism (template-based function dispatching). SWIG wrappers are in place, so the goal will be to expose a few new interface and not break the old wrappers (as opposed to building SWIG support from the ground up). nexml parsers in python, perl, and java will make it relatively easy to check for correct behavior.
Mark Holder and Rutger Vos

Extend Jalview Alignment visualization tool

Jalview is a Java based alignment visualization and editing workbench which includes a light-weight applet specifically designed to be incorporated into bioinformatics web applications. Jalview is widely used (over 2000 launches of the application per week), and is probably best-positioned for enhancement and incorporation of new features to support gene curation and cope with the new challenges arising from Next Generation Sequencing.
A number of areas of development have been suggested.
  • Enhance the Jalview applet in order to improve its functionality as an embedded web application component in an AJAX gene curation application. This would involve introducing new editing and nucleotide alignment specific visualization capabilities, enhancing existing sequence and alignment annotation components, and developing the javascript/Java API further to enable interaction with Distributed Annotation System datasources via the web server.
  • Optimise the Jalview data and I/O model and extend the user interface further to support access to very large datasets.
  • Extend the existing phylogenetic analysis and visualization capabilities
  • Enable the exchange of new phylogenetic file formats (e.g. neXML)
  • Develop or Improve interoperation with other NESCent projects
  • Implementation of UI and visualization features requested by expert annotation curators
  • Extension of a Java Applet API for effective AJAX interaction
  • Refactoring for speed and memory efficient operation
  • I/O models to support large scale data visualization
  • User interface design for large datasets
  • Phylogenetic file format parsing and generation
  • Visualization in AWT and Swing - Extension of existing views, or development of new ones
  • API development for pluggable incorporation of other GPL code for visualization and analysis
Degree of difficulty and needed skills 
  • Medium to hard
  • good Java programming skills including GUI development, and at least a basic knowledge of AWT.
  • working knowledge of javascript and AJAX approaches for accessing XML data sources.
Jim Procter, Albert Vilella

Integrate a reusable ontology term markup module into Phenex

Ontologies and controlled vocabularies provide consistency and interoperability through use of common terms. Many applications display and edit free text descriptions of data. It would be useful to match words and phrases in free text with those found in relevant ontologies, so that these applications could display controlled terms in a special way (color-coding, hyperlinking, etc.). A simple module performing this text matching, or wrapping a 3rd party tool, would make this functionality easier to add to existing applications, promoting the use of controlled vocabularies in biological data.
Investigate approaches to matching ontology terms and phrases to free-text content. This may involve development of an algorithm, or integration of an existing method or tool, perhaps from the natural language processing community. Design and implement a flexible API which allows the matched text to be incorporated into applications in an extensible way. Design and implement a means to specify different controlled vocabulary sources for the module. Integrate this module into Phenex, a desktop application for annotating free text character descriptions with ontological phenotype descriptions, by colorizing ontology terms found in the free text fields.
It will be ideal to design an API that is simple but flexible enough to employ in diverse client applications.
Involved toolkits or projects 
Java, Phenex, OBO-Edit
Degree of difficulty and needed skills 
Medium. This project will involve programming in Java.
Jim Balhoff

Build a Mesquite Package to view Phenex-generated Nexml files

Phenex is a tool used by the Phenoscape project to annotate character matrices with ontology terms. Phenex's native file format is the emerging Nexml format for phylogenetic data, which Mesquite also supports via an existing package and java library. This project will allow users to view Phenex annotated matrices with Mesquite. The value of this is to allow people familiar with Mesquite to view and explore phenex files within an environment they are familiar with.
This will involve building on Mesquite's existing tools for displaying annotations on character matrices and Mesquite's support for the Nexml file format. Ideally, the display could be extended todisplay chained ontology concepts in a graphical format, perhaps using an existing library such as GraphViz.
  • Making the extensions work seamlessly with the existing Mesquite annotations
  • Learning Nexml and particularly its annotation syntax (which may still be in flux)
Involved toolkits or projects
Java, Mesquite, Phenex, Nexml
Degree of difficulty and needed skills
  • Difficulty: Easy to moderate
  • OO language experience (Java preferred)
  • Understanding of ontologies and experience with Mesquite also desirable.
Peter Midford, Jim Balhoff

Web-based application for representing, manipulating and visualizing comparative data

The scale and integrative nature of contemporary evolutionary research call for an easy-to-use, interactive and dynamic software for scientists to visually interact and manipulate complex phylogenetic data (in particular character data and trees), instead of cumbersome hands-on manipulation. The ideal application is cross-platform and web-based application, highly interactive with quick response, and provides efficient remote data-management features. The recent availability of the Google Web Toolkit (GWT), a Java-based framework for developing dynamic AJAX-based web applications, and the Bio::NEXUS Perl library for efficiently representing and manipulating comparative data, greatly ease developing such software.
Develop a web application to upload, generate graphical views of, and to manipulate in an integrated way trees with character data (from, and as contained in, NEXUS-formatted input files). To be useful, the application should generate a view of a tree with its tips aligned to the relevant rows of a sequence alignment (or other character matrix), and support basic tasks such re-rooting the tree, or selecting (excluding) a subset of data defined via a subtree. These operations should be conveniently controlled by point-and-click with a mouse, for example on the relevant node of the tree.
The Nexplorer server provides an example of this kind of interface, using a JavaScript front end to manipulations carried out by Bio::NEXUS methods. A more modular design based on GWT (instead of JavaScript) will allow a much more powerful application that is easier to maintain and extend. Depending on the rate of progress, the student may be able to implement more sophisticated interface elements (e.g., a horizontally scrollable pane for viewing long sequence alignments), or to provide interfaces to external phylogenetic tools that carry out computations on the tree or data (e.g., reconstruct ancestral sequences that can be viewed via the web application).
  1. Ability to build a dynamic, easy-to-use, and easy-to-maintain web application.
  2. Develop XML or JSON object models to exchange data and actions between GWT modules and Bio::NEXUS objects
  3. Understand the Bio::NEXUS and GWT APIs to efficiently design browser-based UIs and CGI queries.
Understanding of Object Oriented Programming concepts and programming skills in Java or Perl will be required.
  1. Easier project if student has Java GUI development experience.
  2. Easier project if student has GWT experience.
  3. Moderate project if student has only web development experience with Java.
  4. Moderate-Hard project if student has only Perl experience.
Involved Projects 
GWT (Google Web Toolkit), Bio::NEXUS, Nexplorer
Vivek Gopalan, Arlin Stoltzfus

Enhance the searching functionality of Phylr

Many online databases contain phylogenetic information, but to date they have not been interoperable. PhyloWS is an emerging standard for interacting with phylogenetic data via web services. At the recent Database Interoperability Hackathon, a group of participants created Phylr, an initial implementation of the SRU search functionality in PhyloWS. Phylr is a general toolkit that allows searching over phylogenetic data stored in the Lucene indexing system. It can be easily extened to work with other indexing systems. Much work remains to make Phylr fully useful, including modifications to make outputs fully conform to the PhyloWS specification, an adapter to allow searching over content in relational databases, and implementation of related PhyloWS features.
Enhance the Java code that translates between the core system and a Lucene index, making it fully conform to the PhyloWS specification. Modify the XSL stylesheets to present a more intuitive interface to the users. Write a new Java adapter that translates between the core system and a relational database.
It is unclear how flexible the database adapter must be to accommodate the target databases. As a first pass, the project may implement a simple adapter that assumes all content is available from a single table, but a more configurable adapter that generates complex SQL may be required for some databases.
Involved toolkits or projects 
Phylr, OCLC's SRU server and Lucene adapter
Degree of difficulty and needed skills 
Medium. Requires familiarity with Java and relational databases.
Ryan Scherle and Lucie Chan

Convert tree file to 3D physical object

image of 3D tree on screen and 3D physical object (Clematis from Shapeways Gallery)
Depiction of phylogenetic trees is a growing problem in biology (see article in NY Times). There are dozens of approaches. Several authors have pursued ways to show trees in 3D (i.e., Sanderson's paloverde, Kim & Lee's 3DPE), but, while useful, these suffer from limited monitor resolution and the lack of true stereoscopic displays by most potential users. At the same time, computer fabrication of 3D objects has become rapidly cheaper and more accessible, with complex objects costing $10 or less and various options available (Shapeways, RepRap, $4,995 desktop printer). 3D printing may avoid some of the resolution issues associated with display on monitors. It also allows the possibility of representing more information in a tree: z-axis for time, x and y-axes for values of two different continuous characters reconstructed on the tree, branch diameters being proportional to species diversity, and so forth. Finally, and most importantly, there are immense broader impacts possible: imagine students around the world touching and learning from physical representations of the complex history of life costing just $15 per classroom and created with the software you write this summer.
There are at least two possible approaches. The first is to take existing 2D tree software and extrude and export solid shapes from the graphics created from those, modifying the output so that one object is created (making sure labels are attached to the tree, for example). A more interesting approach is to either modify existing 3D software or write your own (perhaps starting with R and the RGL library or Processing and the superCAD library) to take a tree file (in Newick and perhaps other formats) and export it to a 3D file. The output from either approach should be a file in a standard CAD format (STL, Collada, X3D, etc.) suitable for uploading and printing, with no errors, to a site such as Shapeways.
  • Conversion of string to 3D object
  • Design decisions about means to include taxon or other labels on the tree
  • Assurance of compliance with chosen output format regardless of (valid) input tree
  • Design decisions about ways to draw trees (traditional square or triangle tree, bushlike, disc, cone, etc.)
  • Allowing users to adjust parameters of output (size of final object, presence of time scale, etc.)
Degree of difficulty and needed skills 
  • Medium
  • good knowledge of at least one CAD format
  • experience with 3D modeling (ideally at a low level, not just manipulating objects in SketchUp)
  • knowledge of relevant language (I can mentor in R, C, C++, Processing, PHP, and Perl, but not in Java or Python, though contact me if interested. For example, Python scripting for Blender might allow easy creation of 3D trees in Python while allowing users to then modify the trees in a friendly GUI (thanks to Kevin Lam of A*STAR Singapore for the suggestion))
Brian O'Meara

BioPerl integration of the NeXML exchange standard + Bio::Phylo toolkit

Note that the Bio::Phylo code links have been updated to reflect the most recent release. --Majensen 16:47, 2 April 2009 (EDT)
NeXML is an emerging XML standard for the serialization and exchange of phylogenetic information. In Perl, the Bio::Phylo toolkit is the preferred parser/writer interface for NeXML. While Bio::Phylo contains methods that will operate on BioPerl objects [such as sequences (Bio::Seq), alignments (Bio::SimpleAlign), or trees (Bio::Tree)], a set of methods to wrap Bio::Phylo functionality into BioPerl in a systematic and updateable way would lower barriers to broader use of this useful standard.
We would like to explore a couple of ways to form the linkage between BioPerl and Bio::Phylo, while still maintaining Bio::Phylo's independence as a module. Since it is part of the implementation side of a rapidly evolving standard, it is more mutable than the average BioPerl module, and should be more nimble. One method would be implement a thin BioPerl wrapper around Bio::Phylo, that allows BioPerl objects to be passed easily in and out, and maintains a stable BioPerl-compliant API, hiding Bio::Phlyo API changes. However, since this project is exploratory, we could also prototype a version of Bio::Phylo that is directly implemented as a BioPerl module. We would also develop appropriate usage tests, test data sets, target audience use cases, benchmarks and profiles to compare the approaches we come up with.
  • Designing a relatively stable wrapper around a relatively dynamic module;
  • Designing tests that cover important use case scenarios meaningful to BioPerl users;
  • Identifying and interfacing Bio::Phylo output and NeXML-serialized data with up- and downstream BioPerl operations; e.g., adding a Bio::SeqIO::nexml module for doing BioPerl-native NeXML IO.
Involved toolkits or projects 
Degree of difficulty and needed skills 
Easy to medium difficulty.
  • Perl fluency required;
  • Experience with object-oriented Perl very helpful;
  • Experience with biological data (sequences, sequence alignments, phylogenetic trees) a plus;
  • Experience with BioPerl itself will flatten the learning curve.
Mark Jensen

Visualization Tool for Protein 3D Structure Evolution

The evolution of protein structures (as opposed to DNA or protein sequences) is an important, but relatively poorly studied subject. In order to address visualization needs in this research subject, we propose to develop a visualization tool which allows to overlay images of phylogenetic trees which images of 3D protein structures. An example of a phylogenetic tree overlaid with protein domain architectures can be seen here.
To aid the study of 3D protein structure evolution, this project aims to extend the Archaeopteryx tree explorer (the successor to ATV) in such a way that it will be capable of: (i) displaying static images of protein structures for each node of the displayed tree which has a 3D structure associated with it, and (ii) by selecting/clicking such a static image, an open source Java-based structural viewer (such as Jmol) will be employed to display the protein structure dynamically. Ideally, changing the view of a structure in the dynamic viewer (e.g. Jmol) will update the corresponding static image, giving the user a way to view all structures in the same way. To describe and store phylogenetic trees overlaid with protein structures, the phyloXML format will be used.
Evaluation of a suitable 3D structure viewer. Need to understand parts of the Archaeopteryx tree explorer source code, as well as of the chosen 3D structure viewer.
Involved toolkits or projects 
Degree of difficulty and needed skills 
  • Proficiency in the Java language, as well as proven experience with the Java Swing and 2D Graphics APIs is absolutely mandatory.
  • Experience in structural biology is critical.
  • Experience with Jmol is a big plus.
Christian Zmasek (and anyone else who wants to help)

Adding metadata support to phylogenetic data representations in R

One result of the recent NESCent Hackathon on Comparative Methods in R has been the development of the phylobase package, which seeks to provide a set of S4 classes and methods for representing and manipulating phylogenetic trees and associated data in R. Phylobase contains structures for representing phylogenetic trees and associated data, but methods for representation of metadata and multiple trees, and interfaces with other data formats (i.e. nexus, nexml) remain incomplete.
The methods for tree/data manipulation and import in phylobase are currently a mixture of S3 and S4 methods and C/C++ extensions. The goal for this project will be to implement metadata support for trees and associated data using object-oriented S4 classes and methods. It would also be necessary to extend/rewrite the interface with the Nexus Class Library which phylobase uses for reading trees in the Nexus format. A final part of this project could be writing and implementing a multi-tree data format for phylobase. Nexus files, and the Nexus Class Library both support multi-tree files, however phylobase has only a rudimentary implementation.
The general challenge for this project will be to identify and implement data structures for metadata that accommodate current metadata needs yet are flexible enough for future use. Additional challenges involve interfacing R and the C++ Nexus Class Library, and will likely require programming skills not only in R, but also in C/C++ and make.
Involved toolkits or projects 
R, phylobase, Nexus Class Library, make.
Degree of difficulty and needed skills 
Easy to medium. Need knowledge of R and S4 classes, C++, and make
Peter Cowan, Steve Kembel

Enhancing the PhyloCode Registry with PhyloWS Querying

The PhyloCode is a set of rules governing clade-based taxonomy. A prototype database called RegNum has been developed to serve as a registry of PhyloCode definitions, but to date it does not include the ability to translate these definitions into queries that can locate clades in sources of trees. To aid in the resolving of PhyloCode names on locally and remotely stored trees, PhyloCode definitions need to be translated into topological queries over the PhyloWS API protocol.
Given that RegNum already exists as a fairly stable basic registry database, we suggest that the student take this product and enhance it with the capability to translate PhyloCode definitions into PhyloWS queries, and using this syntax, interface with BioSQL. We suggest a principal use case: to allow RegNum to PUT user-supplied trees into an instantiation of BioSQL, as well as GET specific clades from BioSQL, or find any clades retrieved by a PhyloCode definition.
Building on existing RegNum codebase means becoming familiar with Hibernate and Cocoon. Gathering name-string intelligence from name services is non-trivial, as these services are in flux. Translation of PhyloCode definitions into topological queries is fairly straight-forward except where specifiers involve synapomorphies, which will require an imaginative solution. Finally, a RESTful service that speaks PhyloWS needs be be built on top of BioSQL with trees returned in NeXML.
Degree of difficulty and needed skills 
Medium. This project will involve programming in Java for Regnum and a suitable language (e.g. perl) for building RESTful services on BioSQL.
Involved toolkits or projects 
RegNum can be used as a core registry database and has excellent documentation; the BioSQL PhyloDB extension can be adapted to RegNum for indexing tree topologies. Web services from uBIO can be used for name-string intelligence. Documentation and libraries are available for PhyloWS and NeXML.
William Piel, Hilmar Lapp

Mapping the Bio++ Phylogenetics toolkit to BioPerl, R/BioConductor and BioJAVA using BioLib

Bio++ is a library for reusable phylogenetics routines written in C++ for performance, including models for phylogentic inference. This functionality is of great interest to many scientists working in molecular biology with application in biology and biomedical research. Unfortunately, few bioinformaticians work in C/C++. Therefore language bindings to popular scripting languages are important for general usage. To be fully effective support should be made available for, at least, Perl, R and JAVA. These three together, probably, represent over 90% of bioinformaticians. The BioLib project successfully provides the 'mapping' infrastructure to map complex libraries against many computer languages using SWIG (the Simplified Wrapper and Interface Generator). Basically one mapping suffices to support all popular languages. Apart from improving the accessibility of Bio++ to many developers, this mapping encourages sharing source code, thereby preventing duplication of effort in the different scripting language domains.
Special interfaces need to be developed to map Bio++ libraries against Perl initially. Having mapped the OOP model against Perl mapping against Ruby and Python is trivial. However, at this point, BioLib support for R and JAVA needs to be developed. A proof-of-concept to map against one of these languages can be part of this project. Finally SWIG mappings can be used to create automated documentation and testing of BioLib code.
The main challenge is to provide nice and consistent interfaces in high-level languages against the Bio++ library. This requires OOP design and unit testing of existing functionality. Also some SWIG hacking may be involved to provide decent mappings for R and JAVA, as well as SWIG auto generated documentation and testing.
Involved toolkits or projects 
Bio++, BioLib, BioPerl, SWIG (and optionally BioRuby, R/Bioconductor, BioJAVA or BioPython)
Degree of difficulty and needed skills 
This is a challenging project as it crosses computer languages. It requires experience in C++, OOP and a wish for deeper understanding of at least one high-level language like Perl, Python, JAVA, R or Ruby. The student will gain both an understanding of the design of a good phylogenetics toolkit and will learn to easily adapt existing C/C++ libraries to his/her favorite scripting language.
Pjotr Prins, Chris Fields

Building out BioPerl PopGen::Simulation modules for infectious disease

Many important infectious organisms, like Staphylococcus aureus and HIV, develop genetic variation so quickly, that their evolution affects the nature of the disease they cause from tranmission to transmission, or even the pathogenesis of the disease within a single infected individual. Traditional techniques of phylogenetic analysis have difficulty dealing with evolution that takes place on the time scale of observation. In the past ten years, population genetic methods to analyze these so-called measurably evolving populations have been developed and used very sucessfully to understand the influence of pathogen evolution on pathogenesis, and to understand and predict the dynamics of infectious disease epidemics. The serial coalescent has provided the basis for many simulations in this effort (see examples here and here), as have versions of the coalescent that are able to incorporate more general models of the underlying populations (another example). A set of tools added to the population simulation modules of BioPerl that implement a few of these modern approaches to the molecular evolution of pathogens will help make these developments accessible to a wider community of researchers.
This project has a couple of components. Either or both could be tackled as time and interest allows.
  • Add expanded coalescent capability
This would entail at least
  • adding the serial coalescent, and
  • incorporating coalescent simulation under exponential growth
These would be "modeled" in a pure-Perl implementation first, with the help of Jonathan Leto's Math::GSL interface to the GNU Scientific Library. The ultimate goal would be to employ bindings to coalescent C libraries; see the BioLib/libsequence idea below.
  • Expand and reorient the PopGen::Simulation API
We want to give serious thought to developing a more general API for coalescent and other simulation implementations. This will likely involve:
  • Developing new modules that rationally collate simulated population gene samples and associate these with basic evolutionary sample statistics (such as Watterson's theta and Nei's pi);
  • Engineering output methods using standard BioPerl IO for these data objects, e.g., writers that format the data for downstream analysis by external programs. This would include wrappers for output in standard sequence formats, as well as input format generators for specialty analysis programs such as HyPHY. Much of the machinery for this currently exists in BioPerl; the work will be in organizing simulated data in a natural and rational way.

  • Grokking the idea of the coalescent;
  • Developing a flexible API for extensions of the simple coalescent idea that will naturally support future development;
  • Creating a seamless BioPerl interface between simulated data and downstream BioPerl-y analyses;
  • Writing good tests based on realistic use cases;
  • Ensuring that pure-Perl and library-bound implementations are always concordant in output;
  • Don't forget the POD!
Involved toolkits or projects 
Degree of difficulty and needed skills 
Medium to hard.
  • Perl fluency required, experience with object-oriented Perl very helpful;
  • Basic population genetics, such as that provided in an undergrad evolution course, is necessary;
  • Strong mathematics background is important;
  • Familiary with the coalescent a plus, but this will be part of the homework (see Michael Cummings' (UMD) beautiful lecture slides);
  • Experience with biological data (sequences, sequence alignments, phylogenetic trees) a plus;
  • Experience with BioPerl itself will flatten the learning curve.
  • Interest in infectious disease, particularly HIV, will provide a larger perspective and increase motivation
Mark Jensen

A BioLib mapping for the libsequence population genetics libraries

Kevin Thornton's libsequence is an extensive C library of routines for modern population genetic simulation. The code was written by some of the best in the business (Richard Hudson, Jeff Wall, KT himself, and others), and includes (in particular) fast and sophisticated algorithms for coalescent simulations. BioLib, on the other hand, was founded to make useful routines like these available to researchers working in high-level languages like Perl and Python, by creating easy-to-use mappings to C and C++ libraries. The purpose of this project is to create BioLib mappings for libsequence routines, perhaps with an eye towards testing them in code developed in the BioPerl PopGen::Simulation project above.
BioLib bindings are specified by means of the SWIG (Software Wrapper and Interface Generator) toolkit. (SWIG is also a participant in the Summer of Code; see their announcement). The tutorial gives a nice idea of how it works.
Once the bindings are in place, we'll want to test them thoroughly. The mentor is a BioPerl devotee, but bindings to other languages should be created and tested as well. One desirable is to create test programs that are useful in their own right. The PopGen::Simulation project could provide this, but any useful simulation/analysis problem would be fair game.
  • Tackling the inevitable glitches that arise in an interoperability problem. This may involve digging into libsequence internals, but this is hard to predict;
  • Going after complete wrapper coverage of libsequence;
  • Writing a test suite that helps us with the previous challenge;
  • Identifying and specifying good use cases to program towards.
Degree of difficulty and needed skills 
  • C fluency required;
  • Fluency in a "high-level" language required; Perl fits best with the mentor (but I hope to learn something new too, of course);
  • Basic understanding of population genetics very helpful, but not absolutely required;
  • Experience with interfaces such as SWIG or XS a plus.
Mark Jensen and Pjotr Prins

Semantic Interoperability between Data Formats using CDAO

The evolutionary informatics working group at NESCent has developed an ontology, called the Comparative Data Analysis Ontology (CDAO), aimed at providing a controlled vocabulary and structure to knowledge commonly used in the field of phylogenetic inference and comparative data anlysis. The specification of CDAO has been formalized in OWL using the Protege ontology editor.
One of the frequent uses of ontologies is to enhance inter-operation between applications -- an ontology can provide an elegant bridge between data formats, enabling the interconversion between them.
This project aims at demonstrating the use of CDAO as an inter-operation mechanism across data formats commonly used in the phylogenetic analysis applications.
The project aims at developing a web service (e.g., using a REST protocol) that can provide inter-conversion between a number of data formats, relying on the mapping of each data format to CDAO. Input request will indicate a file to interconvert, the input data format and the requested output data format; the output of the service will be the file expressed in the requested data format. The intercoversion is realized by mapping the input file to instances of CDAO, and in turn the instaces of CDAO are mapped to the output data format.
Preliminary experiments could involve formats like MEGA, Phylip, and ClustalW.
  • Understanding how to parse the various formats to/from CDAO.
  • Learning about OWL and RDF.
  • Learning about how to provide a REST web service.
Involved toolkits or projects 
CDAO, web service toolkits, XSLT and/or Perl
Degree of difficulty and needed skills 
Medium. The successful applicants needs some background in XML processing, parsing technologies, and possibly web services. Knowledge of XSLT and/or Perl is important. Interest in learning about semantic web technologies would be helpful.
Enrico Pontelli, Brandon Chisham

Cyberinfrastructure Summer Traineeships for Data Repository Interoperability

In collaboration with a consortium of data and metadata data repositories for biodiversity, earth and environmental science, ecological, and evolutionary data we offer a summer traineeship program for students and postdocs interested in informatics as applied to creating an interoperating network of centers ("nodes") that enable discovery and open access to data in any of its member nodes. This summer internship program is inspired by and closely modeled after the Google Summer of CodeTM program, however it is neither part of the Google Summer of Code, nor endorsed by Google or any of its affiliates, and some of the rules differ. See the Cyberinfrastructure Summer Traineeships 2009 program page for more information.


What should prospective students know?

Before you apply

  • Look at our project ideas to get an overall sense for the kinds of projects we support.
  • Pick the idea that appeals most to you in terms of goals, context, and required skills, or you can apply with your own project idea.
  • If you want to apply with your own idea, contact us early on so we can try to find a mentor. Alternatively, you can suggest a candidate mentor to us if you know one.
  • Ask us questions about the project idea you have in mind.
  • Write a project proposal draft, include a project plan (see below), and bounce those off of us.

Have I mentioned yet that you should be in touch with us before you apply?

When you apply

When applying, (aside from the information requested by Google) please provide the following in your application material.

  1. Your interests, what makes you excited.
  2. Why you are interested in the project, uniquely suited to undertake it, and what do you anticipate to gain from it.
  3. A summary of your programming experience and skills
  4. Programs or projects you have previously authored or contributed to, in particular those available as open-source, including, if applicable, any past Summer of Code involvement.
  5. A project plan for the project you are proposing, even if your proposed project is directly based on one of the ideas above.
    • A project plan in principle divides up the whole project into a series of manageable milestones and timelines that, when all accomplished, logically lead to the end goal(s) of the project. Put in another way, a project plan explains what you expect you will need to be doing, and what you expect you need to have accomplished, at which time, so that at the end you reach the goals of the project.
    • Do not take this part lightly. A compelling plan takes a significant amount of work. Empirically, applications with no or a hastily composed project plan have not been competitive, and a more thorough project plan can easily make an applicant outcompete another with more advanced skills.
    • A good plan will require you to thoroughly think about the project itself and how one might want to go about the work.
    • We don't expect you to have all the experience, background, and knowledge to come up with the final, real work plan on your own at the time you apply. We do expect your plan to demonstrate, however, that you have made the effort and thoroughly dissected the goals into tasks and successive accomplishments that make sense.
    • We strongly recommend that you bounce your proposed project and your project plan draft off our mentors by emailing (see below). Through the project plan exercise you will inevitably discover that you are missing a lot of the pieces - we are there to help you fill those in as best as we can.
  6. Your possibly conflicting obligations or plans for the summer during the coding period.
    • Although there are no hard and fast rules about how much you can do in parallel to your Summer of Code project, we do expect the project to be your primary focus of attention over the summer. If you look at your Summer of Code project as a part-time occupation, please don't apply for our organization.
    • That notwithstanding, if you have the time-management skills to manage other work obligations concurrent with your Summer of Code project, feel encouraged to make your case and support it with evidence.
    • Most important of all, be upfront. If it turns out later that you weren't clear about other obligations, at best (i.e., if your accomplishment record at that point is spotless) it destroys our trust. Also, if you are accepted, don't take on additional obligations before discussing those with your mentor.
    • One of the most common reasons for students to struggle or fail is being overstretched. Don't set yourself up for that - at best it would greatly diminish the amount of fun you'll have with your Summer of Code project.

Get in touch with us

Please send any questions you have and ideas and work plans for projects you would like to propose to This will reach all mentors and our organization adminstrators. Alternatively, you can reach the (a little larger) community comprising not only of current but also previous mentors, students, hackathon participants, and other phyloinformatics software developers at

  • We strongly recommend you do this even if you want to work directly on one of our project ideas above. It gives you an opportunity to get feedback on what our expectations might be, and you might want to ask for more specifics.
  • Also, you can bounce your project plan off of us. In the past those students who engaged with us before submitting their application invariably had vastly better applications and project plans.
  • Keep in mind that we are most interested in students who give us evidence that they have already or might develop a sustained interest in becoming future contributors to our open source projects. How do you convey that If the first time we see something from you is when we score your application?
  • The value of frequent and early communication in contributing to a distributed and collaboratively developed project can hardly be overemphasized. The same is true for becoming part of a community, even if only temporarily. If you can't find the time to engage in communication with your prospective organization before you have been accepted, how will you argue that you will be a good communicator once accepted?

Other information

Reference Facts & Links: Google Summer of Code 2009

  • Google has accepted 150 mentoring organizations out of nearly 400 that applied between March 9-13, 2009.
  • Students apply online between March 23-April 3, 2009. The eligibility requirements for students are in the GSoC FAQ. Note that the student application period ended April 3, and accepted students have been announced by Google on April 20.
  • After close of student applications, the mentors of the organizations score the applications, and Google allocates slots to each organization. The top scoring N students for each organization, N being the number of slots allocated, get accepted and announced on April 20, 2009.
  • The coding period is from May 23 to August 17, 2009. Between acceptance and the start of coding, students are expected to acquaint themselves with the project, the code bases involved, and the respective developer community, as well as set up everything needed for developing and committing (code repository access, mailing list subcription, software and dependencies installation, compiling, etc).
  • Development occurs on-line, there is no requirement or expectation to travel, neither for students nor for mentors.