Phyloinformatics Summer of Code 2010

From Phyloinformatics
Jump to: navigation, search

We were accepted as a mentoring organization for the 2010 Google Summer of Code (GSoC) program for the forth time in a row.

On this page we were first collecting ideas, possible projects, prerequisites, possible solution approaches, mentors, other people or channels to contact for more information or to bounce ideas off of, etc. You can now also find the projects that were accepted with brief summaries of the goals, who the student and mentor were for each project, and where project page, source code repository, and, if the student kept one, a development blog are located. After completion of the program in August 2010 the students wrote up a short summary for each project.

For information on previous years, please see our ideas pages and accepted projects for our 2007 (summaries), 2008 (summaries), and 2009 (summaries) program participations.




Our organization administrators are Hilmar Lapp ( and Todd J. Vision (

If you are a student interested in applying for a Google Summer of Code project in our organization, please send any questions you have, projects you would like to propose, etc to This will reach all mentors and our Summer of Code organization adminstrators.

During the student application period we will hang out regularly on IRC at least on weekdays during working hours (EDT) in #phylosoc on Freenode. You're welcome to join us at any time, though we may not be on-line outside of those times. Alternatively, there's always email. (If you do not have an IRC client installed, you might find the comparison on Wikipedia, the Google directory, or the IRC Reviews helpful. For Macs, X-Chat Aqua works pretty well. Freenode also has a web-chat interface (enter #phylosoc for the channel) that should work from any browser. If you have never used IRC, try the IRC Primer at IRC Help, which also has links to lots of other material.)

Before you apply, please read our guidelines and expectations . We don't have an application format that you need to adhere to, but we do ask that you include specific kinds of information (see "When you apply").


Our aim is to cultivate the next generation of open-source software developers in evolutionary biology and promote the open, collaborative development of reusable, interoperable, standards-supporting informatics tools. We aim to increase the awareness and understanding of the value of open-source and collaborative development, and also impart the programming and remote collaboration skills needed to successfully contribute to such projects.

The disciplinary focus for our participation is phyloinformatics and evolutionary biology, and our community of mentors are leaders in this field. The code that students write will facilitate new, and increasingly complex, scientific questions to be answered about the patterns and processes underlying the history of life. Work on projects along the lines of our suggested project ideas is bound to make an impact. The range of problems for which students can make meaningful contributions to is diverse, enabling us to accommodate students with varying interests. Similarly, the project ideas cover a variety of programming languages and skill levels.

Accepted Projects

Note that the accepted projects, students, and their mentors are also published by Google.

Ancestral State Reconstruction in R

This is an application for the Do ancestral state reconstruction in R by porting Brownie (C++) (R/C/C++) project idea.

Abstract: The main goal of this project will be the integration of important phylogenetic tools for ancestral state reconstruction and rate testing (via Brownie) into a commonly used scripting environment (R). As R is a cross-platform scripting environment for which there are currently external packages to read-in and manipulate genetic and phylogenetic data. R is an ideal place to port the functions contained in Brownie due to its robust linkage to C/C++ (innately and through external packages) and because of its potential to act as a staging area for all steps in phylogenetic data analysis, from reading in genetic data to running analyses to plotting the results.

Student: Conrad Stack

Mentor(s): Brian O'Meara (primary), Luke Harmon

Project Homepage: PhyloSoC:Ancestral State Reconstruction in R

Project blog: Check here for updates

Source Code: Brownie - googlecode

Extending Jalview Capabilities to Support RNA Sequence Alignment Annotation and Secondary Structure Visualization

This is an application for the Google maps-like multi-genome browsing in Jalview project idea.

Abstract: The overall goal of this project is to extend many of the useful features Jalview has for protein sequence alignments to support RNA sequence analysis. I will add support for RNA secondary structure alignment annotation, embed RNA secondary structure viewer, and add the ability to import existing RNA sequences and alignments from the Rfam database.

Student: Lauren Lui

Mentor(s): Jim Procter (primary), Albert Vilella

Project Homepage: PhyloSoC:Extending Jalview to Support RNA Alignment Annotation and Secondary Structure Visualization

Project blog: GSOC 2010 - Adding RNA Support to Jalview

Source Code: Hosted by Google

Georeferencing Library Implemented in Java

This is an application for the GeoPhyloBuilder in Java (GeoPhyloBuilder4J) and Google Earth Browser and Query Interface for TreeBASE project ideas.

Abstract: Java library for mapping phylogenetic trees and geographical information in KML or shapefile. Given an input of a phylogenetic tree file and geographical information, the library will return a KML or shapefile with the phylogenetic tree mapped in 3D space.

Student: Kathryn Iverson

Mentor(s): David Kidd (primary), Karen Cranston, Xianhua Liu, Bill Piel

Project Homepage: phyloGeoRef

Project blog:

Source Code:

Galaxy phylogenetics pipeline development

This is an application for the Galaxy phylogenetics pipeline development project idea.

Abstract: The goal of this project is to integrate HyPhy sequence analysis tools into Galaxy.

Student: Filip Balejko

Mentor(s): Sergei Kosakovsky Pond (primary), Brad Chapman, Anton Nekrutenko

Project Homepage: PhyloSoC:Galaxy phylogenetics pipeline development

Project blog:

Source Code:

Develop an API for NeXML I/O, and, RDF triples for BioRuby

This is an application for the Evolution and the semantic web in Ruby: NeXML I/O for BioRuby project idea.

Abstract: Add NeXML parsing and serializing support, and an RDF triples API to BioRuby.

Student: Anurag Priyam

Mentor(s): Rutger Vos (primary), Jan Aerts

Project Homepage: Develop an API for NeXML and RDF triples for BioRuby

Project blog: My Weblog( phylosoc label )

Source Code: Github


Note: if there is more than one mentor for a project, the primary mentor is in bold font. Additional information about the mentors is linked to in the Mentors section.

Students: we have listed only our own project ideas, albeit well thought-out ones. You are welcome to propose your own project provided it still falls within the scope of the aims described above. If we like your proposal, we will try to find an appropriate mentor. Better yet, propose a mentor yourself! Regardless of what you decide to do, make sure you read and follow the guidelines for students below.

Evolution and the semantic web in Ruby: NeXML I/O for BioRuby

This project will be an outstanding opportunity to contribute to the next generation of the internet, as described here by Tim Berners-Lee, inventor of the World Wide Web.

NeXML is a data format for evolutionary informatics that can readily be transformed to RDF/XML and JSON. The format allows for semantic annotation with predicates and objects from any namespace, thereby providing unparalleled flexibility and extensibility of evolutionary data on the semantic web. BioRuby is a fast-growing toolkit for bioinformatics currently lacking NeXML I/O, and, more generally, an API for triples. This project will remedy both, providing not just the immediate - but narrow - advantage of read and write capability of this new data format, but also a more generally applicable framework for dealing with RDF triples in BioRuby.
The succesful applicant will design a set of parser/serializer classes similar to Bio::Nexus and Bio::PhyloXML, and a rich API for representing RDF triples (subject, predicate, object).
  • Firstly, the succesful applicant will need to understand NeXML file structure well enough to process and serialize it correctly. This will most likely be an iterative process, making the simplest objects work first, then expanding compatibility to cover the full specification (as far as is possible in the time available).
  • Secondly, the succesful applicant will need to design an API for RDF triples that the rest of the BioRuby community is happy with. This will involve generating a draft early on in the project, submitting this as a request for comments to the appropriate mailing lists, then refining it as community feedback comes in.
Involved toolkits or projects 
The BioRuby toolkit has much of the needed framework. The NeXML project provides a documented specification and test files.
Degree of difficulty and needed skills 
  • Ruby and XML experience is required,
  • BioRuby experience is preferred.
The following skills would be wonderful to have, but if no applicant has these coming in we will teach you as the project progresses:
  • understanding of RDF and the semantic web;
  • understanding of collaborative software development practices.
Rutger Vos (go-to person for NeXML, primary mentor), Jan Aerts (BioRuby guru).

Google maps-like multi-genome browsing in Jalview

The amount of genomic data is increasing exponentially since the advent of the next generation sequencing techniques. New visualizations tools are needed to be able to compare genomic data at multiple resolutions in a natural way. One way to do this would be to adopt user interaction principles from successful multi-resolution interfaces, such as Google Maps.
Most problems with Jalview's interface become obvious when working with large alignments. An initial examination of the UI issues should first be made, and some solutions proposed to improve the user's experience when working with large alignments (and trees). These solutions should then be prototyped using a snapshot of the current Jalview 2.X codebase so their effectiveness can be assessed by expert users.
  • (easy/medium) adapting Jalview's existing multi-window visualizations implemented in AWT and Swing
  • (hard) efficient multiscale rendering of very large multiple sequence alignments
  • (medium/hard) solving the memory and rendering issues that arise when interactively visualizing and editing very large alignments.
Degree of difficulty and needed skills 
Medium to hard. Java. Experience with either or both AWT and Swing. Some familiarity with low level file handling, or alternately, experience with handling datasets with many thousands of sequences (e.g. with Picard).
Other Topics
There are plenty of other areas to work on in Jalview - topics include AJAX and DAS (extending the JalviewLite for working with DAS annotation servers), Phylogenetic visualization (wider tree format support and better interactive visualization) and extending support for RNA (linked secondary structure visualization, alignment and analysis services). Please contact the Mentors for further information.
Jim Procter and Albert Vilella

Biopython and PyCogent interoperability

PyCogent and Biopython are two widely used toolkits for performing computational biology and bioinformatics work in Python. The libraries have had traditionally different focuses: with Biopython focusing on sequence parsing and retrieval and PyCogent on evolutionary and phylogenetic processing. Both user communities would benefit from increased interoperability between the code bases, easing the developing of complex workflows.
The student would focus on soliciting use case scenarios from developers and the larger communities associated with both projects, and use these as the basis for adding glue code and documentation to both libraries. Some use cases of immediate interest as a starting point are:
  • Allow round-trip conversion between biopython and pycogent core objects (sequence, alignment, tree, etc.).
  • Building workflows using Codon Usage analyses in PyCogent with clustering code in Biopython.
  • Connecting Biopython acquired sequences to PyCogent's alignment, phylogenetic tree preparation and tree visualization code.
  • Integrate Biopython's PhyloXML support, developed during GSoC 2009, with PyCogent.
  • Develop a standardised controller architecture for interrogation of genome databases by extending PyCogent's Ensembl code, including export to Biopython objects.
This project provides the student with a lot of freedom to create useful interoperability between two feature rich libraries. As opposed to projects which might require churning out more lines of code, the major challenge here will be defining useful APIs and interfaces for existing code. High level inventiveness and coding skill will be required for generating glue code; we feel library integration is an extremely beneficial skill. We also value clear use case based documentation to support the new interfaces.
Involved toolkits or projects 
Degree of difficulty and needed skills 
Medium to Hard. At a minimum, the student will need to be highly competent in Python and become familiar with core objects in PyCogent and Biopython. Sub-projects will require additional expertise, for instance: familiarity with concepts in phylogenetics and genome biology; understanding SQL dialects.
Gavin Huttley, Rob Knight, Brad Chapman, Eric Talevich

Galaxy phylogenetics pipeline development

Galaxy is a popular web based interface for integrating biological tools and analysis pipelines. It is widely used by bench biologists for their analysis work, and by computational biologists for building interfaces to developed tools. HyPhy provides a popular package for molecular evolution and sequence statistical analysis, and the server provides web based workflows to perform a number of common tasks with HyPhy. This project bridges these two complementary projects by bringing HyPhy workflows into the Galaxy system, standardizing these analyses on a widely used platform.
The student would bring existing workflows from to Galaxy. The general approach would be to pick a workflow, wrap the relevant tools using Galaxy's XML tool definition language, and implement a shared pipeline with Galaxy's workflow system. Functional tests will be developed for tools and workflows, along with high level documentation for end users.
This project requires the student to become comfortable working in the existing Galaxy framework. This is a useful practical skill as Galaxy is widely used in the biological community. Similarly, the student should become familiar with the statistical evolutionary methods in HyPhy to feel comfortable wrapping and testing them in Galaxy. Since the tools would be widely used from the main Galaxy website and installed instances, we place a strong emphasis on students who feel comfortable building tests and examples that would ensure the developed workflows function as expected.
Involved toolkits or projects 
Degree of difficulty and needed skills 
Medium to Hard. As envisioned, the project would involve implementing full phylogenetic pipelines with the Galaxy toolkits. This would require becoming familiar with the Galaxy tool integration framework as well as being comfortable with HyPhy tools and current pipelines. This would involve comfort with XML for developing the tool interfaces, and Python for integrating scripts and tests with Galaxy and HyPhy.
Sergei L Kosakovsky Pond, Brad Chapman, Anton Nekrutenko

Accessing R phylogenetic tools from Python

The R statistical language is a powerful open-source environment for statistical computation and visualization. Python serves as an excellent complement to R since it has a wide variety of available libraries to make data processing, analysis, and web presentation easier. The two can be smoothly interfaced using Rpy2, allowing programmers to leverage the best features of each language. Here we propose to build Rpy2 library components to help ease access to phylogenetic and biogeographical libraries in R.
Rpy2 contains higher level interfaces to popular R libraries. For instance, the ggplot2 interface allows python users to access powerful plotting functionality in R with an intuitive API. Providing similar high level APIs for biological toolkits available in R would help expose these toolkits to a wider audience of Python programmers. A nice introduction to phylogenetic analysis in R is available from Rich Glor at the Bodega Bay Marine Lab wiki. Some examples of R libraries for which integration would be welcomed are:
  • ape (Analysis of Phylogenetics and Evolution) -- an interactive library environment for phylogenetic and evolutionary analyses
  • ade4 -- Data Analysis functions to analyse Ecological and Environmental data in the framework of Euclidean Exploratory methods
  • geiger -- Running macroevolutionary simulation, and estimating parameters related to diversification from comparative phylogenetic data.
  • picante -- R tools for integrating phylogenies and ecology
  • mefa -- multivariate data handling for ecological and biogeographical data
The student would have the opportunity to learn an available R toolkit, and then code in Python and R to make this available via an intuitive API. This will involve digging into the R code examples to discover the most useful parts for analysis, and then projecting this into a library that is intuitive to Python coders. Beyond the coding and design aspects, the student should feel comfortable writing up use case documentation to support the API and encourage its adoption.
Involved toolkits or projects 
Degree of difficulty and needed skills 
Moderate. The project requires familiarity with coding in Python and R, and knowledge of phylogeny or biogeography. The student has plenty of flexibility to define the project based on their biological interests; there is also the possibility to venture far into data visualization once access to analysis methods is made. GenGIS can give ideas about what is possible.
Laurent Gautier, Brad Chapman

GeoPhyloBuilder in Java (GeoPhyloBuilder4J)

Geophylobuilder for ArcGIS (Kidd and Liu, 2008) creates geophylogenies from input evolutionary trees and associated distribution data. Geophylogenies are spatiotemporally georeferenced phylogenies; the same data structure can also be used to represent reticulate spatiotemporal relationships between any type of geographical entities including islands, continents, watersheds and areas of endemism. Geophylobuilder for ArcGIS, which itself is open-source, was developed in .Net using the ESRI ArcObjects COM library, hence an (expensive) software license is required for its use. Moreover, this host software dependency restrict users to use Windows only. Many none-PC users can not benefit from this tool.
This project will develop an java version of Geophylobuilder (Geophylobuilder4J). Geophylobuilder4J should support input of a tree in NEWICK format and sample locations as points, lines, or polygons encoded as a .csv or shapefile. It should also support n:m relationships between tree tips and observations. Allowing user-defined node positioning would allow geophylogenies to be built using the output of ecological niche models or other historical biogeographical reconstruction approaches. Geophylogenies should be output as KML files for viewing in earth browsers or in the shapefile format that can be read by many GIS software tools such as ESRI ArcGIS. A user friendly GUI should be built to interact with the system functions and guide through the whole process of building the geophylogeny.
The main challenges are in designing and developing the GUI interface and designing the KML structure. Java programming skills, especially Swing skills will be needed and knowledge in computational geometry and GIS will help.
Involved toolkits or projects
  1. Current GeoPhyloBuilder as AcrGIS Extension
  2. Previous Work in GeoPhyloBuilder4J
  3. Jump Shapefile API
  4. Google Earth
David Kidd, Xianhua Liu

Interoperability of evolutionary bioinformatics tools on the desktop

The desktop user of bioinformatics tools has access to many specialised tools. Often the user needs to combine these tools to get results. With the command iline one can use Unix pipes, however, on the desktop the basic clipboard is too simplistic and can not compete. D-Bus allows desktop tools to communicate with each other, but also with tools running elsewhere, using a message based paradigm through a central daemon. It is like using Unix pipes, but in a more advanced way. We want to create a workflow where database queries and algorithmic searches are combined using D-bus. We will take two existing applications and adapt them for D-bus. This will act as a reference implementation for other bioinformatics tools.
We will take two existing applications and adapt them for D-bus. An example would be adapt dotter (dotplot) and a DAS source like Ensembl, to investigate gene fusion and gene fission events. Another example would be to adapt PHYLIP to allow cutting and pasting sequence information through D-Bus in an interactive fashion.
Understanding D-Bus. C programming for dotter and PHYLIP, Perl or Ruby for Ensembl
Involved toolkits or projects 
Dotter, Ensembl-API, PHYLIP
Degree of difficulty and needed skills 
Steffen Möller

Phylogenetic XGAP support

XGAP is an open XML based eXtensible Genotype And Phenotype platform for all QTL, GWL, GWA and mutagenisis data, both for natural and experimental crosses, currently with support for R (R/QTL) and JAVA (Molgenis); see Genome Research paper. XGAP currenly lacks support for inheritence trees and needs to be extended with phylogenetic information to accomodate storage and querying of inherited information from pedigrees and experimental crosses. We can test using a complex example, the new collaborative mouse cross - an experimental cross from multiple founder populations (a similar set is the Arabidopsis 'Magic' cross). These crosses are starting to be used in experiments, and an XGAP implementation will help dissection of complex (evolutionairy) traits and can serve as a starting point adding generic tree structures into XGAP.
Extend the XGAP standard by writing a parser matching individual genotype with founder genotypes, building a phylogentic tree of the relations between the individuals in a certain experiment. This can be accomplished by extending the existing JAVA XML parsers to accommodate for tree like structures and use this parser to incorporate an existing format, like PhyloXML or Newick, into XGAP XML. If the student is interested further extensions of XGAP are possible, for example for tracking the phylogenetic tree and dealing with lethality in crosses.
The challenge is to master the XGAP and PhyloXML/Newick standards, and XML parsing in JAVA, as well as building up an understanding of experimental populations and basic genetics.
Involved toolkits or projects 
XGAP, R/qtl, Molgenis
Degree of difficulty and needed skills 
Medium. The student should have a general interest in biology and genetics. Basic XML and JAVA skills are required. No prior understanding of breeding/pedigrees is required.
Danny Arends and Pjotr Prins

Jump-start MIAPA protocol annotation with a user-accessible demo

We anticipate that this project will jump-start a larger community process of evaluating, improving, and using a reporting standard that will enhance the value of phylogenetic information. The value of a implementing a minimum reporting standard for phylogenetic analyses is discussed in "Taking the first steps towards a standard for reporting on phylogenies: Minimum Information About a Phylogenetic Analysis (MIAPA)", and in a NESCent white paper. Several journals have announced coming changes in data archiving policies that will create a "market" for such a MIAPA standard, and for tools that compose MIAPA-compliant reports, validate reports, and query collections of reports. The idea with this project is 1) to create a demo that will show phylogenetics users the cool things that they can do given such a standard and 2) to provide a straw-man version of MIAPA that can serve as the basis for future development.
Build a web application that supports:
  • Interactive generation of a compliant report through a form
  • Serializing the report as a single downloadable file -- preferably NeXML format
  • Validating an uploaded report for compliance with MIAPA, defined as an ontology (like this)
  • Structured querying of a collection of reports (e.g., find trees by algorithm, as in this CDAO triplestore)
The MIAPA standard has not yet been specified, and the draft version used in this project may evolve, perhaps rapidly. The code that supports it must be flexible enough to accomodate quick changes, preferably decoupled from the specification it supports, but still simple enough to serve as a reference implementation for other developers who wish to support MIAPA. Part of the code to support this project would be the ontology that provides "semantic" or "language" support.
Involved toolkits or projects 
  • CDAO, and OWL ontology that can be edited using Protege)
  • RDF/OWL library for the chosen programming language
  • NeXML library for the chosen programming language
  • Google App Engine to host the implementation (suggested)
  • Python or Java
  • perhaps ISA-Tab
Degree of difficulty and needed skills 
Medium, at least, but with numerous ways to make the project more challenging. An understanding of the basics of phylogenetics is essential. The difficulty of linking the MIAPA ontology to a NeXML writer will depend on the quality of the NeXML library used and the current state of the MIAPA specification; some familiarity with ontologies, as well as the NeXML or Nexus formats, will help here too. The basic aspects of the web app component are not difficult. A range of challenges are possible for more advanced programmers: validation and error–reporting; uses of reasoning; implementing a triplestore; and so on.
Eric Talevich, Arlin Stoltzfus, Jim Leebens-Mack (MIAPA project lead), Rutger Vos

Phylomovies: interactive animations of gene tree evolution

This project would produce a tool useful for both science and outreach, and which could ultimately become linked with the Ensembl database and be used by thousands of biologists worldwide every day.

Evolving DNA is usually represented by static pictures, like trees, that are sometimes difficult to decipher. A movie is a more natural medium for presenting the evolutionary processes that shape the genomes of all living species. Comparative genomics databases such as Ensembl, Pfam, and HomoloGene use large amounts of data and powerful inference methods to construct highly-resolved evolutionary histories of individual gene families, including the characteristic patterns of duplication and speciation leading to the gene content observed today. However, the static nature of image-based visualizations can make the interpretation of such "gene family trees" a bit difficult.
In the same way that the Google-acquired Trendalyzer technology uses animation to clarify and expose trends in economic data, an evolutionary movie would help scientists and the public better understand the meaning and importance of evolutionary data.
The goal of this project would be to create an online widget that loads gene tree data in standard formats (XML / NHX format trees) and generates an interactive movie, allowing users to temporally navigate through the evolution of a given gene family. This would involve the design, implementation and exhibition of a novel visualization interface.
In terms of implementation, one option would be to use the PhyloWidget codebase as a starting point, although other options (such as Flash) should be considered, depending on the student's expertise. A mock-up animation of the evolution of the Tropomyosin gene can be found here. Animations relating a gene tree can also be found here [1]. Some more eye-candy for tree-related animations can be viewed in a VC visualization tool here [2].
Possible extensions to the tool would include the incorporation / correlation of species tree information and divergence time estimates, and exporting animations to non-interactive movie formats.
  • (medium) Identify the most appropriate libraries and tools to work with
  • (hard) Define and implement a system for converting static phylogenetic trees (and associated metadata) into animations
  • (medium-to-hard) Create an interface for visualizing / exploring / exporting the animations
Involved toolkits or projects 
Degree of difficulty and needed skills 
Medium. Will require creative thinking, interface design, and solid background in a client-side browser language (Java, JavaScript or Flash).
Other topics
Any alternative proposals aimed towards improving the "accessibility" of phylogenetic trees to non-specialist audiences are encouraged. Also, alternative proposals would include implementing new methods for visualizing biological data on trees: e.g., bootstrap values / tree uncertainty, population size estimates, or uncertainty in divergence times.
Gregory Jordan (PhyloWidget), Albert Vilella (Ensembl Compara)

Interactive phylogeny browser for touch-sensitive tablet computing devices

Tablet computing devices have tremendous potential for visualizing large data sets. These devices are likely to be more and more common in the coming years. Implementing a viewer for large phylogenetic trees would harness the power of these devices and help us make sense of the tree of life.
Develop a application for interacting with large phylogenetic datasets for the Apple iPad. Credit goes to Rod Page for his blog post encouraging this idea!
Creating a intuitive interface that is fast and responsive.
Degree of difficulty and needed skills 
Hard. This project involves programming a new interface for a new device. The interface should be very fast, responsive, and work in an intuitive way to change the way people can interact with their data.
Luke Harmon and Erica Bree Rosenblum

Mapping the BEAGLE GPU library to Perl/Python/Ruby/JAVA/R using Biolib

Beagle-lib is a general purpose library for evaluating the likelihood of sequence evolution on trees. BEAST provides a flexible modelling library for theoretical biologists to evaluate population models in a Bayesian framework. A variety of models are supported: single stage / age models, stage-structured population models, and age-structured population models. BEAST allows MCMC parameter estimation, population projections, and to conduct formal Bayesian decision analyses, both in C and JAVA. The C version of the libraries allows the use of GPU optimizations. These can be made available to Perl, Python, Ruby, JAVA and R using Biolib/SWIG mappings. Biolib has the infrastructure for mapping C libraries to these languages. Support for GPU compilers in Biolib is missing, as well as SWIG build support for R.
Learn to understand the Biolib build system and create a version that can handle mappings of existing modules using SWIG. Next modify the build system to allow for compiling GPU code and linking against one of the languages Perl, Python or Ruby.
The Biolib build system is based on a combination of CMake and SWIG. Gaining a full understanding of these tools is necessary to complete the project.
Involved toolkits or projects 
Beagle-lib, BEAST, Biolib, SWIG, CMake
Degree of difficulty and needed skills 
High. Requires a deeper understanding of Perl (or Python/Ruby) with C and C compilers. BEAST passes data around in XML, so there is a balance of mapping methods and adapting XML parsing.
Pjotr Prins, Aaron Darling, Chris Fields

Visualization of Protein 3D Structure Evolution

The evolution of protein structures (as opposed to DNA or protein sequences) is an important, but relatively poorly studied subject, for it allows to rationalize functional differences of proteins at the molecular level and enables functional inferences about extant proteins, as well as ancestral ones. In order to analyze protein 3D structures in a evolutionary context, we propose to develop a visualization application which allows to interactively overlay images of phylogenetic trees which images of 3D protein structures. One intriguing application of such application is the display and analysis of ancestral, inferred protein structures. An example of a phylogenetic tree overlaid with protein domain architectures can be seen here. A mock-up image is available here.
To aid the study of 3D protein structure evolution, this project aims to extend the Archaeopteryx tree explorer in such a way that it will be capable of: (i) displaying static images of protein structures for each node of the displayed tree which has a 3D structure associated with it, and (ii) by selecting/clicking such a static image, an open source Java-based structural viewer (such as Jmol) will be employed to display the protein structure interactively (zooming, rotation, etc.). Ideally, changing the view of a structure in the dynamic viewer (e.g. Jmol) will update the corresponding static image, giving the user a way to view all structures in a similar manner (e.g. zoomed into a binding pocket). To describe and store phylogenetic trees overlaid with protein structures, the phyloXML format will be used.
Evaluation and study of a suitable 3D structure viewer (this will most likely end up being Jmol). Need to understand parts of the Archaeopteryx tree explorer source code, as well as of the chosen 3D structure viewer.
Involved toolkits or projects 
Jmol (probably), Archaeopteryx (forester), phyloXML
Degree of difficulty and needed skills 
Medium-Hard. Proficiency in the Java language, as well as experience with the Java Swing and 2D Graphics APIs is necessary. Experience in structural biology is useful. Experience with Jmol is a big plus.
Christian Zmasek, Kyle Ellrott, Egon Willighagen

Google Earth Browser and Query Interface for TreeBASE

TreeBASE is currently released in a new open-source version, including numerous new features, such as (1) a mapping between OTU labels in trees and a controlled taxonomy, (2) the ability to attach georeference metadata to data associated with OTUs, and (3) a rich RESTful PhyloWS API for programmatic access to the data. Additionally, we are hiring number of students to upgrade metadata in TreeBASE so that by the end of the summer a significant number of records will be annotated with georeferences. A preliminary effort exists for a prototype phylogeny browser in Google Earth, so the rationale here is to redevelop a Google Earth interface into TreeBASE data, either by making use of TreeBASE georeferencing metadata or by making use of other resources (e.g. GBIF) for inferring distributions of taxa.
Improving TreeBASE's API and develop software so that Google Earth can display georeferenced 3D trees from TreeBASE.
Creating an interface browser that is fast and responsive, yet does not overcrowd the earth with trees. Extra challenges include implementing additional query parameters (other than the geographic area box parameters provided by Google Earth); incorporating geological time in the altitude of nodes over the earth; curving horizontal branches to deal effectively with the earth's circularity; designing a tool that provides real phylogeographic insights (aside from fun eye candy). The student may need to assemble an initial set of phyloreferenced trees for prototyping and testing -- e.g. by extracting geo metadata from publications in TreeBASE, or BLASTING against Genbank and extracting lat_lon tags.
Degree of difficulty and needed skills 
Moderate. Understanding KML and PhyloWS; modification of TreeBASE's API requires knowledge of Java and Hibernate.
William Piel Karen Cranston

Extend the NEXUS Class Library to parse nexml, phyloxml, and the NOTES block in NEXUS

nexml and phyloxml are emerging formats for expressing phylogenetic data. The NEXUS Class Library (NCL), is a mature parser for the NEXUS format (and some other simple formats). Currently NCL does not do a rich parse of the NOTES block in NEXUS (a block that stores metadata about taxa, characters, and trees). Nor does it read nexml. By adding a generic interface for associating metadata (as strings) with core objects, NCL can provide an API rich enough to parse the NOTES block of NEXUS and a nexml file. This would make this data available to client code (which include several projects used throughout the phylogenetics community ( GARLI, phycas, Brownie and phylobase, the CIPRES portal, MrBayes v4), and would allow that software to read nexml.
Design an API for returning hierarchichal notes for core parts of the phylogenetic data analysis (a group of taxa, a taxon, a set of trees, a tree, a node, a cell of a character matrix...). Implement parsing of a NOTES block to so that the content of notes can be returned in this way. Implement parsing of nexml so that content can be put into NCL's data structures (and queried through the extended API).
nexml is quite rich in terms of accommodating meta-data. While attaching content to the core data is easy in dynamic languages, it will be challenge to keep track of the metadata within C++. Fortunately, the core data model of nexml and NEXUS are similar (the core collections are taxa, character matrices, and trees). Adding metadata about these three entities is supported in NCL, but not the types of meta-data are not terribly rich. Fortunately, storing hierarchical structured strings (like a DOM subtree) connected to the correct object will be sufficient in most cases. Documentation of the newest NCL features is not great (fortunately the author of those extensions, Mark Holder, is a proposed mentor; so questions can be handled efficiently while the docs are improved).
Involved toolkits or projects 
NCL, SWIG, nexml, phyloxml, Xerces (or some other XML parsing library for C/C++)
Degree of difficulty and needed skills 
Medium difficulty for a programmer who is comfortable with C++. NCL templates and some multiple inheritance and a tiny bit of compile-time polymorphism (template-based function dispatching). SWIG wrappers are in place, so the goal will be to expose a few new interface and not break the old wrappers (as opposed to building SWIG support from the ground up). nexml parsers in python, perl, and java will make it relatively easy to check for correct behavior.
Mark Holder, Rutger Vos (NeXML)

Do ancestral state reconstruction in R by porting Brownie (C++) (R/C/C++)

Brownie is a command-line program written in C++ that does analyses of continuous trait evolution (different rates, various models), discrete trait evolution (allowing great flexibility in models), species delimitation, and more. Importantly, adding it to R would expand R's abilities in ancestral state reconstruction. It is built using the NCL, a tree library from Rod Page, and the GSL. Porting it to R would provide several advantages, such as easy cross-platform compilation, easy incorporation into workflows, tree visualization, and interaction with the constellation of other approaches available in R.
Approach & Challenges
There are a few possible routes to take. One is to just create a wrapper for brownie using Rcpp or some similar interface. The difficulty with this would be making the C++ code and its dependencies reliably compile. A different approach is to pull algorithms from Brownie and recode them in R or in C, but this requires detailed understanding of the algorithms. Note that this combines two ideas previously listed separately on this page (look at wiki history to see them)
Involved toolkits or projects 
Brownie, R, possibly others (i.e., phylobase, NCL, GSL, and Rcpp).
Degree of difficulty and needed skills 
Medium, though where the difficulty lies depends on the approach used. Ability to read C++ and write Makefiles important, as is experience in R.
Brian O'Meara, Luke Harmon

Related Programming and Cyberinfrastructure Summer Internship Programs


What should prospective students know?

Before you apply

  • Look at our project ideas to get an overall sense for the kinds of projects we support.
  • Pick the idea that appeals most to you in terms of goals, context, and required skills, or you can apply with your own project idea.
  • If you want to apply with your own idea, contact us early on so we can try to find a mentor. Alternatively, you can suggest a candidate mentor to us if you know one.
  • Ask us questions about the project idea you have in mind.
  • Write a project proposal draft, include a project plan (see below), and bounce those off of us.

Have I mentioned yet that you should be in touch with us before you apply?

When you apply

When applying, (aside from the information requested by Google) please provide the following in your application material.

  1. Your interests, what makes you excited.
  2. Why you are interested in the project, uniquely suited to undertake it, and what do you anticipate to gain from it.
  3. A summary of your programming experience and skills
  4. Programs or projects you have previously authored or contributed to, in particular those available as open-source, including, if applicable, any past Summer of Code involvement.
  5. A project plan for the project you are proposing, even if your proposed project is directly based on one of the ideas above.
    • A project plan in principle divides up the whole project into a series of manageable milestones and timelines that, when all accomplished, logically lead to the end goal(s) of the project. Put in another way, a project plan explains what you expect you will need to be doing, and what you expect you need to have accomplished, at which time, so that at the end you reach the goals of the project.
    • Do not take this part lightly. A compelling plan takes a significant amount of work. Empirically, applications with no or a hastily composed project plan have not been competitive, and a more thorough project plan can easily make an applicant outcompete another with more advanced
    • A good plan will require you to thoroughly think about the project itself and how one might want to go about the work.
    • We don't expect you to have all the experience, background, and knowledge to come up with the final, real work plan on your own at the time you apply. We do expect your plan to demonstrate, however, that you have made the effort and thoroughly dissected the goals into tasks and successive accomplishments that make sense.
    • Note that the start of the coding period should be precisely that - you start coding. That is, tasks such as understanding a code base, or investigating a tool or a standard, should be done before the start of coding, during the community bonding period. If you can't accomplish that due to conflicting obligations during the community bonding period, make that clear in your application, and state how you will make up the lost time during the coding period.
    • For examples of strong project plans from prior years, look at our Ideas pages from those years, specifically the Accepted Projects section (e.g., 2008, 2009). Visit the listed project pages, many have the initial project plan (which was subsequently updated and revised, as the real project unfolded). Good examples are PhyloXML Support in BioRuby, PhyloXML support in Biopython, or Mesquite Package to View NeXML Files, all 2009 projects.
    • We strongly recommend that you bounce your proposed project and your project plan draft off our mentors by emailing (see below). Through the project plan exercise you will inevitably discover that you are missing a lot of the pieces - we are there to help you fill those in as best as we can.
  6. Your possibly conflicting obligations or plans for the summer during the coding period.
    • Although there are no hard and fast rules about how much you can do in parallel to your Summer of Code project, we do expect the project to be your primary focus of attention over the summer. Our view is that it should be the same as an internship you take up over the summer, except this one is remote (and almost certainly more fun :-) If you look at your Summer of Code project as a part-time occupation, please don't apply for our organization.
    • That notwithstanding, if you have the time-management skills to manage other work obligations concurrent with your Summer of Code project, feel encouraged to make your case and support it with evidence.
    • Similarly, if you expect any of your conflicting plans to take away from your coding time, please explain when and how you are going to make up the lost time.
    • Most important of all, be upfront. If it turns out later that you weren't clear about other obligations, at best (i.e., if your accomplishment record at that point is spotless) it destroys our trust. Also, if you are accepted, don't take on additional obligations before discussing those with your mentor.
    • One of the most common reasons for students to struggle or fail is being overstretched. Don't set yourself up for that - at best it would greatly diminish the amount of fun you'll have with your Summer of Code project.

Get in touch with us

Please send any questions you have and ideas and work plans for projects you would like to propose to This will reach all mentors and our organization adminstrators.

  • We strongly recommend you do this even if you want to work directly on one of our project ideas above. It gives you an opportunity to get feedback on what our expectations might be, and you might want to ask for more specifics.
  • Also, you can bounce your project plan off of us. In the past those students who engaged with us before submitting their application invariably had vastly better applications and project plans.
  • Keep in mind that we are most interested in students who give us evidence that they have already or might develop a sustained interest in becoming future contributors to our open source projects. How do you convey that If the first time we see something from you is when we score your application?
  • The value of frequent and early communication in contributing to a distributed and collaboratively developed project can hardly be overemphasized. The same is true for becoming part of a community, even if only temporarily. If you can't find the time to engage in communication with your prospective organization before you have been accepted, how will you argue that you will be a good communicator once accepted?

If you are interested in hearing from our previous Summer of Code mentors or students, you can also send email to It's a larger list, so in that case try to be specific about what kinds of people you'd like to hear back from.

Other information

How to apply

Reference Facts & Links: Google Summer of Code 2010

  • Mentoring organizations applied between March 8-12, 2010. We were accepted March 18, 2010.
  • Students apply online between March 29-April 9, 2010. The eligibility requirements for students are in the GSoC FAQs.
  • After close of student applications, the mentors of the organizations score the applications, and Google allocates slots to each organization. The top scoring N students for each organization, N being the number of slots allocated, get accepted and announced on April 26, 2010.
  • The coding period is from May 24 to August 16, 2010. Between acceptance and the start of coding, students are expected to acquaint themselves with the project, the code bases involved, and the respective developer community, as well as set up everything needed for developing and committing (code repository access, mailing list subcription, software and dependencies installation, compiling, etc).
  • Development occurs on-line, there is no requirement or expectation to travel, neither for students nor for mentors.