Phyloinformatics Summer of Code 2012

From NESCent Informatics Wiki
Jump to: navigation, search

NESCent is a mentoring organization for the 2012 Google Summer of Code. We are a so-called 'umbrella organization', meaning that we administer the program on behalf of many different open-source projects rather than one single project. All of our projects are related to computational evolutionary biology, but use a wide variety of programming languages.

This page contains our possible project ideas, as well as information about how to apply.

For information on previous years, please see our pages of project ideas and accepted projects for our 2007, 2008, 2009, 2010, and 2011 programs.

Contents

News

  • For continuous updates, check or follow our Twitter feed or our Google Plus page
  • Join our our Phyloinformatics Summer of Code Announcement list for announcements at major milestone points of the program (call for mentors and project ideas, success in applying as mentoring organization, call for student applications, etc.). This is a low-traffic lists for announcements only.

Contact us

Our organization administrators this year are Karen Cranston (karen.cranston@nescent.org) and Jim Procter (jprocter@compbio.dundee.ac.uk).

Our Mentors page lists our mentor community.

Mailing lists

If you are a student interested in applying for a Google Summer of Code project in our organization, please send any questions you have, projects you would like to propose, etc to phylosoc@nescent.org. This will reach all mentors and our Summer of Code organization administrators. Before you apply, please read our guidelines and expectations . We don't have an application format that you need to adhere to, but we do ask that you include specific kinds of information (see "Writing your application").

If you are a potential mentor with a project idea, please add the idea to the list of project ideas. If you would like to discuss your idea with current and past mentors before or after posting it, you can send a message to the phylosoc@nescent.org mailing list.

Social Media

Can you can follow @PhyloSoc on Twitter, and you may find updates or discussions around project ideas on Google+.

IRC

During the student application period we will hang out regularly on IRC at least on weekdays during working hours (EDT) in #phylosoc on Freenode. You're welcome to join us at any time, though we may not be on-line outside of those times. Alternatively, there's always email. (If you do not have an IRC client installed, you might find the comparison on Wikipedia, the Google directory, or the IRC Reviews helpful. For Macs, X-Chat Aqua works pretty well, as does Colloquy. If you have never used IRC, try the IRC Primer at IRC Help, which also has links to lots of other material.)

Accepted projects

A program to compute probabilities of ranked gene tree topologies in species trees

Abstract: A polynomial-time algorithm has been described for computing probabilities of ranked gene tree topologies given species trees. Once, implemented, ranked gene tree probabilities could be used to infer species trees, although inferring species trees is beyond the scope of the project. The idea is to consider ranked gene tree topologies, where we distinguish the relative order of times of nodes on gene trees, but not the real-valued branch lengths.

Google Maps-like matrix browser

Abstract: Matrices are basic data stores used throughout evolutionary studies. Large (1000 by 1000), collaboratively built matrices of quantitative/qualitative are becoming increasingly common on the web, and a generic mechanism to browse (interact) with these data in the browser via a "windowing" mechanism would be broadly useful. The project focuses on building an interface which allow users to browse large data stored in the matrices. With this, user would be able to see and analyse large data stored on the web. It requires to build a jQuery library that will produce a Google-maps-like tile based interface. Users can drag to area of their interest and can see the data more clearly. To improve the performance neighbouring shell should be loaded in the background. Although the original idea is to have the plugin load and visualize large matrices, a well built tool for this task could further be used to display more complex objects (e.g. heat maps, highly detailed images etc.) as these objects can also be treated in form of a matrix. Hence the aim here would be to allow an open format for the cell values (e.g. HTML div) that can be extended to allow any kind of data in the future.

Mentors: Chris Baron - secondary mentor - cbaron@fieldmuseum.org. I'm currently working on phylografter ( http://code.google.com/p/phylografter/ )

Project Homepage: https://github.com/pulkit110/jMatrixBrowse


Integrating RNA web services in Jalview using JABAWS

Abstract: Jalview is an alignment editor, originally developed for protein sequences, that is used in a variety of web pages and is also available as a stand-alone project. Jalview also includes a client that interacts with web services via JABAWS. In previous Google Summers of Code, students have worked to extend the functionality of Jalview to handle non-coding RNA sequences. Many web services already exist that are related to RNA, but they are not yet available to the Jalview client. During this Google Summer of Code, I will integrate the Annotate3D and RNAfold web services with Jalview using the JABAWS system.


MASTodon - a Java tool for the summary and visualization of large sets of phylogenetic trees

Abstract: I want to make “MASTodon”, a Java tool that works with already established open-source programs to summarize and visualize a large set of phylogenetic trees. Specifically, I am interested in implementing the tree pruning algorithms presented in Karen Cranston and Bruce Rannala's paper with an emphasis on good scaling with larger input sets.


NeXML to MIAPA Mapping & ISAtab Transformation

Abstract: MIAPA is a proposed minimum information standard for phylogenetic data sharing and reuse. Barriers to its adoption include lack of detail and definition in the standard itself (a draft standard has only recently been produced) and lack of tools supporting it. My project will focus on the latter. This project will identify data elements within NeXML that are MIAPA compliant and seek to extract them via XSL and then transform them into the ISAtab format via Java and XSLT. This will facilitate MIAPA use within ISAtools, a software system for collection, sharing, and repository submission with built-in support for minimum information standards. I also propose extensions for the project if I am ahead of schedule.


Optimizing R code for approximate Bayesian computing

Abstract: Approximate bayesian computing is becoming a more powerful approach for phylogenetic comparative methods. Unfortunately there are no real user friendly open source packages to run this oft-times computationally intensive process. The R package TreEvo hopes to fill this void but there exists a very large amount of code which may or may not function optimally. I propose to go through this code, test and optimize large loops in order to speed upTreEvo's simulation based bayesian architecture.


Apply Machine Learning Algorithm(s) to Ecology Data

Abstract: With the prevalence of 16S sequence data there is a need for ecologists to classify different populations associated with different conditions. To this end, the goal of this project is to create a program that will allow microbial ecologists to apply machine learning algorithms (e.g. random forests, classifiers) to microbial ecology data so they can identify bacterial populations that are associated with differences between health and disease.


Phylowood.js: Browser-based Interactive Animations of Ancestral Dispersal and Diversity Patterns

Abstract: This project will develop an open-source Javascript package to generate interactive browser-based animations from phylogeographical and biogeographical datasets and inferences. The objective of this project is to design a tool that facilitates scientific discourse, both between researchers and between scientists and the public.

Student: mlandis at berkeley dot edu

Mentors: Trevor Bedford, Andrew Rambaut

Source Code: https://github.com/mlandis/phylowood

Project ideas

Note: if there is more than one mentor for a project, the primary mentor is in bold font.

Students: we have listed only our own project ideas, albeit well thought-out ones. You are welcome to propose your own project provided it still falls within the scope of the aims described above. If we like your proposal, we will try to find an appropriate mentor. Better yet, propose a mentor yourself! Regardless of what you decide to do, make sure you read and follow the Before you apply instructions, below.

Prospective mentors: to add a new project idea, sign in, and then click the edit link to the right of this section. Check out the tips and guidelines on creating project ideas.


Gmaps-like Matrix Browsing

Rational 

Matrices are basic data stores used throughout evolutionary studies. Large (1000 by 1000), collaboratively built matrices of quantitative/qualitative are becoming increasingly common on the web, and a generic mechanism to browse (interact) with these data in the browser via a "windowing" mechanism would be broadly useful. The interface could be used for any data that can be tablified according to some x/y co-ordinate system that returns the jSON spec (e.g. DNA would also be possible, but this is perhaps not the optimal approach for it), thus making it extremely broadly useful.

Approach 

Build a generic jQuery library that will produce a Google-maps-like tile based interface. Matrices will be displayed through a window (say 20x20 cells). Users can drag the window around as they would a Google map. Users can increase or decrease the window size. Neighboring cells are loaded in the background. The jQuery library is generalized to request these windowed data (x1,y1, x2, y2) from any API that can return jSON according to the spec we develop. We might not need the zooming capability of a google map, though it could be explored.

Real life testing/integration of the library will be done in conjunction with one or more of the NESCent sponsored projects presently adopting mx (source at https://github.com/mx3/mx), i.e. the student will help with, but not ultimately be responsible for, this integration.

Challenges 
  • Help to define the scope of data-serving API (but student will not implement the API), e.g. what kinda of data can the jQuery library request. A lot of mentor help will be provided here.
  • Providing a seamless user interface by loading data outside the display window in the background; consider caching mechanism (HTML5 WebSQL?).
  • Provide an intuitive interface to clicking through the matrix (e.g. attaching functionality to cell/value clicks)
  • Clearly documenting the library's functionality (including unit tests prior to code being written)
Degree of Difficulty and Needed Skills 

Medium to high. The student would need solid experience in JavaScript, jQuery, AJAX, and DOM. Experience with HTML5, and possibly other javascript-based visualization frameworks (e.g. those that use SVG, canvas, Processing) would be useful. Low-level familiarity with the Google Maps API would definitely be useful.

Selection criteria (not exhaustive list, but might help with guidance) 
  • Creative
  • Highly Organized
  • Familiarity with the inner-workings/approach of Google Maps (not required)
  • Familiar with a wide range of web-page interfaces
Mentors
Matt Yoder (lead)
Sandy Hall

+(?)

Bio::NEXUS integration: Stepping toward BabelPhysh

Rationale 
This project aims to begin building up the back-end functionality for BabelPhysh (a hypothetical translation service) by implementing object compatibility between 3 complementary Perl packages: Bio::NEXUS, Bio::Phylo and BioPerl. For something like BabelPhysh (and more generally, for interoperability in phyloinformatics), one needs a set of operations (possibly distributed among many packages) that collectively allow any correctly formatted tree or alignment to be read, to be modified by some common operations such as OTU-renaming, and to be written out in any other format. Bio::NEXUS provides an excellent parser for NEXUS (very important due to its widespread use, and the existence of different "flavors"); BioPerl has a simple NEXUS implementation, but provides access to a wide range of alignment formats, as well as tools; Bio::Phylo provides a NeXML interface, among other things. Bio::NEXUS has some object-specific 'equals' methods vital for testing, e.g., print 'the trees match!\n' if $tree->equals($other_tree). Bio::Phylo and BioPerl are integrated, but Bio::NEXUS is not-- this explains the focus on Bio::NEXUS.
Approach 
Ultimately the approach is up to the programmer, but we can suggest some possibilities. One way to simplify the project is to focus on trees first. The following code snippet should work if Bio::NEXUS implements the BioPerl TreeI interfaces:
# read trees with Bio::NEXUS, write with BioPerl
my $treeio = Bio::TreeIO->new( '-format' => 'nhx', '-file' => 'outfile.dnd' );
my $nexus = Bio::NEXUS->new('filename.nex');
for my $treeblock ( $nexus->get_blocks('trees') ) {
	for my $tree ( $treeblock->get_trees ) {
		$treeio->write_tree($tree);
	}
}
To test this, imagine reading in a NEXUS tree using Bio::NEXUS, writing it out using BioPerl, then reading it into a new object, and using Bio::NEXUS to compare the old and new objects to see if they are the same (note that a text diff will not work for trees due to the arbitrary ordering of child nodes). This kind of test could be used to drive object compatibility. For other examples of code, see this code
Challenges 
  • learn about BioPerl and Bio::NEXUS interfaces and objects
  • develop tests
  • fixing flaws revealed by tests (e.g., Bio::NEXUS equals methods have not been tested thoroughly)
Involved toolkits or projects
BioPerl
Bio::NEXUS
Bio::Phylo
Degree of difficulty and needed skills 
  • medium difficulty for someone adept at Perl OO programming and familiar with data concepts; harder if no exposure to data concepts
  • test-driven approach requires patience and attention to detail
  • prior familiarity with data concepts (alignments and trees) would be highly beneficial
Mentors 
  • Rutger Vos is the expert programmer who designed and implemented Bio::Phylo. He leads the NeXML project and has been an important contributor to CIPRES and TreeBASE.
  • Arlin Stoltzfus is the Bio::NEXUS project leader, and is involved in the CDAO and NeXML projects.
  • Vivek Gopalan is a professional bioinformatics programmer and a key developer of Bio::NEXUS and Nexplorer, as well as numerous tools to aid researchers at NIAID.

Optimize R code for Approximate Bayesian Computation

Rationale 
Approximate Bayesian Computation is an exciting approach to biology. We have code to do this in R for comparative methods, but it is ripe for optimization. Do you hate "for" loops? Like multithreading? Want to do checkpointing? If you like "apply", please apply.
Approach 
Go through existing code and find inefficiencies. Eliminate them.
Challenges 
  • Making loops run in parallel
  • Optimizing speed
  • Utilizing HPC approaches
Degree of difficulty and needed skills 
  • Low
  • R
  • experience in HPC for R ideal
Selection criteria (not exhaustive list, but might help with guidance) 
  • Quality of proposal (feasible given your skills and available time, useful outcome)
  • Interaction (response to emails, back-and-forth regarding proposal)
  • Code sample: take any "for" loop in the TreEvo code and show how you would speed it up.
Mentors 
Primary (TBC - contact if interested), Co-mentor: Barb Banbury


Phylogenetic plots with ggplot and R

Rationale 
ape is a popular R package for handling phylogenetic trees (its functions are imported by 36 other R packages). It implements a number of basic and advanced phylogeny plots, but these functions use "base" R graphics and can be inflexible or unwieldy when a user wishes to visualize data along the branches of a tree.
Approach 
The popular ggplot package implements a more flexible approach to visualization, where the user specifies the aesthetic mappings between data and visual elements.
To connect ape with ggplot, the main trick is to convert a phylogenetic tree (which is simply a data structure) into a set of visual elements (lines, points, etc.) that ggplot understands. This involves laying out a tree and choosing an appropriate set of visual elements. An early-stage package, ggphylo, performs the necessary conversions and provides a simple API for researchers to plot phylogenetic data using R and ggplot.
The idea of this project is to expand and improve upon the ggphylo package.
Tasks 
Possible work includes any of the following, depending on the skills and interests of the applicant:
  • Add support for additional visual elements (bootstrap values, images on nodes, etc.)
  • Optimize functions for visualizing large and/or many trees at once
  • Improve the API functions for manipulating ape tree objects (pruning / grafting tree branches, highlighting nodes or lineages, etc.)
  • Create additional examples for the documentation, ideally involving the use of other R packages. (Use trees from treebase, plot character states / rates using picante, etc.)
  • Add support for plotting sequence alignments.
Challenges 
  • API design -- how can you balance simplicity and flexibility in the exposed plotting functions?
  • Performance considerations -- ggplot is generally less efficient than base graphics, so how can large trees be handled?
  • Documentation and testing -- any work built to last should include complete tests and lucid documentation.
Degree of difficulty and needed skills 
  • Low / Medium. Only requires programming experience in R. May involve a non-negligible amount of writing (documentation and examples).
Mentors 
Greg Jordan

Write a program to compute probabilities of ranked gene tree topologies in species trees

Rationale 
A polynomial-time algorithm has been described for computing probabilities of ranked gene tree topologies given species trees. This algorithm has not yet been implemented. Once, implemented, ranked gene tree probabilities could be used to infer species trees, although inferring species trees is beyond the scope of the project. The idea is to consider ranked gene tree topologies, where we distinguish the relative order of times of nodes on gene trees, but not the real-valued branch lengths.
Approach 
Ideally, the program will be written in C++ or Python and will be make available through an open-source package such as R. Developing an arbitrary precision or high-precision (more than long double precision) will be important since the desired probabilities will be very small.
Degree of difficulty and needed skills 
Hard. The programming will involve setting up tree structures that include "beaded" species trees (look for the paper "A polynomial time alogorithm for calculating the probability of a ranked gene tree given a species tree" by Stadler and Degnan at arXiv.org for an explanation). The student should have some familiarity with working with tree structures and programming with recursion.
Mentors 
  • James Degnan has developed code for computing probabilities of gene trees given species trees in the unranked case in C. The code is at [1] and his email is jamdeg@gmail.com.
  • Tanja Stadler has developed packages in R (TreePar2.2 and TreeSim1.5) and has worked with James on ranked gene trees.

Graphical User Interface for ape

Rationale 
Over the last 10 years, ape has become a tool of choice for developing new phylogenetic and comparative methods as witnessed by the circa 50 R-packages that depend on it. On the other hand, ape and its companion packages appear difficult to learn and to use for a signigicant number of biologists. The goal of this project idea is make ape more accessible to a wider range of users by the development of a user-friendly, innovative interface to access some functionalities of ape without the need to learn the R commands or syntax.
Approach 
It is planned that the GUI will be made of a several modules:
  • a 'data mapper' that will display the data sets in memory (trees, DNA sequences, phenotypic data, lists of splits, ...) with links among them, where applicable, showing which object has been derived from which other;
  • a 'tree navigator' to allow tree manipulations with the mouse;
  • a menu for basic analyses (DNA distances, nj, ...); the corresponding commands will be output in a console following the model of Rcommander.

The R package tcltk, part of the standard R installation, will be used. Some sample codes are already available from the mentor as well as some demos in tcltk itself.

Challenges 
  • Making a GUI that satisfies most users is undoubtedly a challenging task
  • The output of this Idea must be a tool to enhance exchanges among different 'populations' of researchers, rather than a pretty software tool for biologists
  • The GUI will be designed to be extensible to other functions and packages, particularly phangorn and diversitree, in a way that makes future additions easy. Future extensions in the field of population genetics (adegenet, pegas) or paleobiology (paleotree, paleoTS, paleoPhylo) should be made possible as well.
Degree of difficulty and needed skills 
  • Low to moderate
  • R
  • Tcl/Tk (preferred)
Selection criteria (not exhaustive list, but might help with guidance) 
  • Creative
  • Interactive
  • Able to combine or synthesize different points of view
Mentor 
Emmanuel Paradis

Apply machine learning algorithm(s) to ecology data

Rational
With the prevalence of 16S sequence data there is a need for ecologists to classify different populations associated with different conditions. To this end, the goal of this project is to create a program that will allow microbial ecologists to apply machine learning algorithms (e.g. random forests, classifiers) to microbial ecology data so they can identify bacterial populations that are associated with differences between health and disease.
Background
16S sequence data is the standard for identifying and classifying microbial DNA sequence data. The 16S gene can be thought of as a 'name tag' identifying which species are present in the population. Similar 16S sequences are grouped together into operational taxonomic units (OTUs). The community consists of a number of different OTUs. Microbial communities associated with a host, human gut microbial communities for example, change along with the health of the host. It is, therefore, important that we understand how these communities are changing. To get at this information we can look at how the members of the community change. One way to do this is to create a tool that is able to identify what populations are associated with health and disease.
Approach
Pick a machine learning algorithm (probably random forest but this is by no means set in stone) and implement it for classification of microbial communities. An R package implements something similar. Reviewing that code may give a sense of the difficulty of implementation. A similar approach in this paper gives some background on what the program will do.

This program will be included as a command in the program mothur and depending on how far along the code is at the end of GSoC, submitting it to mothur could be one of the project goals. As such, it should have no or minimal dependencies on other libraries and obviously no closed source code. It would be good for the student to become familiar with the filetypes used in mothur, specifically shared files and design files. These files contain information about the communities to be analyzed and will be the input files for the program.

Mothur source code on github.

Difficulty and skills
  • Moderate, may be more or less difficult dependent on familiarity with machine learning
  • C++
  • An understanding or willingness to quickly learn machine learning algorithms.
Mentor
Kathryn Iverson

PhyloPic Enhancements

Rationale 
PhyloPic is an open database of reusable organism silhouettes. All silhouettes are available under Creative Commons licenses, and each is linked to any number of taxonomic names. Names are arranged in a phylogenetic taxonomy, modified from classifications in ClassificationBank. Thus, images can be searched for phylogenetically.
Approach 
I have one possible project in mind, although there are many other tasks that need to be done, as can be seen [on the project's BitBucket page]. Feel free to peruse that and suggest another project.
  1. Cladogram Generator - given an input phylogeny or a set of species names, create a diagram illustrated with silhouettes from PhyloPic. This can be accomplished using one of the PhyloPic APIs (currently in development but very close to completion). The generator would be a useful tool for authors and educators seeking to quickly generate visual diagrams that communicate phylogenetic hypotheses.
Challenges 
  • Working with a web tool at an alpha stage; there may be some kinks to work out!
  • Knowing when to use existing API functionality and when to propose new functionality.
  • Creating a tool that yields something useful up front but allows the user to tweak and make further modifications as needed.
  • Finding possible images for nodes that are unlabeled (i.e., hypothetical taxonomic units) or which don't have an exact match.
Involved toolkits or projects 
Degree of difficulty and needed skills 
  • Medium to High Difficulty
  • familiarity with phylogenetic trees
  • familiarity with a graphical client-side technology such as Flash (including AMF remoting) or JavaScript (including JSON, AJAX) with SVG or HTML5 Canvas.
  • If doing a project other than the Cladogram Generation, the following technologies may be important: Python, Django, PostgreSQL, HTML, CSS, JavaScript.
Mentors 

Extend NCL (the NEXUS Class Library) to parse nexml, phyloxml, and the NOTES block in NEXUS

Rationale 
nexml and phyloxml are emerging formats for expressing phylogenetic data. The NEXUS Class Library (NCL), is a mature parser for the NEXUS format (and some other simple formats). Currently NCL does not do a rich parse of the NOTES block in NEXUS (a block that stores metadata about taxa, characters, and trees). Nor does it read nexml. The beginnings of a generic interface for associating metadata (as strings) with core objects have been written (in a branch of the svn repository https://ncl.svn.sourceforge.net/svnroot/ncl/branches/xml ). NCL can provide an API rich enough to parse the NOTES block of NEXUS and a nexml file. This would make this data available to client code (which include several projects used throughout the phylogenetics community ( GARLI, phycas, Brownie and phylobase, the CIPRES portal, RevBayes), and would allow those software packages to read nexml.
Approach 
Design an API for returning hierarchichal notes for core parts of the phylogenetic data analysis (a group of taxa, a taxon, a set of trees, a tree, a node, a cell of a character matrix...). Implement parsing of a NOTES block to so that the content of notes can be returned in this way. Implement parsing of nexml so that content can be put into NCL's data structures (and queried through the extended API).
Challenges 
nexml is quite rich in terms of accommodating meta-data. While attaching content to the core data is easy in dynamic languages, it will be challenge to keep track of the metadata within C++. Fortunately, the core data model of nexml and NEXUS are similar (the core collections are taxa, character matrices, and trees). Adding metadata about these three entities is supported in NCL, but not the types of meta-data are not terribly rich. Fortunately, storing hierarchical structured strings (like a DOM subtree) connected to the correct object will be sufficient in most cases. Documentation of the newest NCL features is not great (fortunately the author of those extensions, Mark Holder, is a proposed mentor; so questions can be handled efficiently while the docs are improved).
Involved toolkits or projects 
NCL, SWIG, nexml, phyloxml, Xerces (or some other XML parsing library for C/C++)
Degree of difficulty and needed skills 
Medium difficulty for a programmer who is comfortable with C++. NCL templates and some multiple inheritance and a tiny bit of compile-time polymorphism (template-based function dispatching). SWIG wrappers are in place, so the goal will be to expose a few new interface and not break the old wrappers (as opposed to building SWIG support from the ground up). nexml parsers in python, perl, and java will make it relatively easy to check for correct behavior.
Mentors 
Mark Holder

Tools for minimum information reporting in phylogenetics

Rationale 
One of the key impediments to sharing and reusing phylogenetic trees, or more generally phylogenetic analysis results, is that the metadata researchers would need to assess a phylogenetic data record's fitness for reuse is frequently reported in an incomplete and inconsistent manner, or not at all. To address this, a group of researchers several years ago proposed a minimum reporting standard for phylogenetics, called Minimum Information for A Phylogenetic Analysis (MIAPA). MIAPA has recently gained wider traction and its implementation and adoption have been identified as one of the key components to increase the digital archiving and sharing of phylogenetic data and knowledge. As an outcome of a workshop in fall 2011, a first draft of a MIAPA metadata attribute checklist now exists. However, there is no tool support for it yet, which hinders application and subsequent further refinement of the draft so it is best aligned with the needs of data producers and consumers. There are 3 main areas for which having tools would be particularly valuable: 1) Extracting the MIAPA-pertinent information already present in existing file formats; 2) Adding MIAPA-compliant annotation to existing file formats, interactively by users, and by parsing (e.g., tabular) annotation files generated by other tools for managing phylogenetic data; and 3) Determining the extent of MIAPA compliance of a given phylogenetic data record.
Approach 
The ISAtools project provides a suite of software tools and a tabular format definition in support of several minimum reporting standards, and would thus be the first candidate for extension. ISAtools is based on a Investigation-Study-Assay hierarchy for organizing a reporting standard's metadata. Although at present the standards supported by the suite are focused on wetlab (and high-throughput) assays, phylogenetic analyses could be fit into this model as computational "assays". Within this framework, a variety of actual projects and approaches are possible, depending on one's interests and skill level. An actual project could focus on as little as extracting MIAPA-pertinent information from existing formats and writing it out in ISA-TAB and/or adding it back to the file tagged and easily recognizable as MIAPA-compliant metadata. It could be an extension of ISAconverter that reads MIAPA information from ISA-TAB and adds it to NeXML or NEXUS for submission to TreeBASE. Or it could be an interactive MIAPA annotation tool that combines all 3 of the above areas, or any subset of them.
Challenges 
Albeit first proposed 6 years ago, the MIAPA standard is still in an early draft stage, and its formalization in the form of schemas or ontologies is just starting. This will require frequent communication not only with the mentors, but also the communities with a stake in further development of the standard, including pertinent data archives (TreeBASE, Dryad), emerging interoperability standards in evolutionary informatics (NeXML, CDAO), and ontologies controlling the values for metadata attribute (e.g., EDAM).
Involved toolkits or projects 
MIAPA, ISAtools, NeXML (or possibly PhyloXML, or NEXUS).
Further reading:
Degree of difficulty and needed skills 
Medium or hard. The project idea supports a variety of actual projects and approaches with differing degrees of difficulty and complexity, by aiming to cover more or fewer of the areas outlines above. ISAtools is written mostly in Java, although the ISA-TAB format should be easily parseable in all scripting languages. NeXML's metadata annotation model is based on RDF. Ontologies will use OWL.
Mentors 
Hilmar Lapp, Jim Leebens-Mack (MIAPA), Chris Taylor (MIBBI), Eamonn Maguire & Philippe Rocca-Serra (ISAtools)

A Web front for CONDOR Grid Computing: phyloinformatic analysis on Windows desktops

Rationale 
Grid computing on idle Windows desktops is a technique for making the most of an organization's computing resources. Such grids have been deployed for the purpose of phyloinformatic analysis at a number of institutions including in Oxford, Reading and Leiden, which all use CONDOR. On such systems, users can submit jobs from the command line, which creates a high barrier for non-expert users. Your task is to simplify this by (further) developing or customizing a web platform that allows users to start jobs from their browser instead:
  • Choose a web application (e.g. Galaxy, Mobyle or PISE) for scheduling bioinformatics jobs.
  • Extend the front and back end of the web app so that phyloinformatic analyses can also be submitted through it, specifically RAxML, but with the flexibility to add e.g. MrBayes, PhyML, R, etc. at a later date.
  • Extend the back end so that jobs can be submitted to CONDOR worker nodes, not just locally.
  • Package the web application, CONDOR, and the executables which the web application launches into a linux image that can be booted by coLinux.
  • Success is tested by the ability to run a standard RAxML tree search through this architecture.
Approach 
Pick a web application platform for scheduling analyses and modify it to enable phyloinformatic analyses on CONDOR grids. Package the web app, the tools it launches and CONDOR into a linux image that is easily deployable using coLinux.
Challenges 
You will be given considerable freedom in choosing, extending, configuring and designing the web front, choosing Linux distros to extend and approaches for deploying them, but the challenge is to come up with a workable deliverable. Applications will be judged based on how detailed the implementation plan is, i.e. we have to be able to tell you know how you're going to do this.
Involved toolkits or projects 
CONDOR, coLinux, RAxML, Galaxy,Mobyle or PISE (or similar), A linux distribution of your choosing, though you might consider PhyLIS as a starting point. You will need to be skilled in whichever programming language the web application is programmed in. This could be Perl (PISE) or Python (Galaxy, Mobyle) but proposals that make the case for another language (e.g. Ruby/RoR, Java) will be considered as well.
Degree of difficulty and needed skills 
Moderate. For the programming part, you will need to know how to map from a request that comes into the web application to launching a job on CONDOR and to monitor that job, notifying the user when the job is complete. You will need to be reasonably skilled at virtualization techniques and linux configuration. You will need to make a wise choice for web front end platform (research this!) and customize it so that commonly-used applications in phyloinformatics can be deployed through it.
Mentors 
Rutger Vos (primary), Barbara Gravendeel

Phylontal: Phylogenetically aware ontology alignment

Rationale 
This project aims to implement the remaining phylogenetic components of Phylontal, a tool for constructing multiple alignments of taxon specific ontologies while taking phylogeny into account. By taking phylogeny into account, this tool allows users to make meaningful alignments based on assertions of either homology or evolutionary convergence.
Approach 
The initial development of the tool occurred in 2009, which a very basic functionality (ontology and phylogeny loading, some simple lexical based alignment, and a swing-based user interface). The focus for the SoC project will be in implementing additional methods for suggesting/automating initial alignments and extending the alignment across the tree, perhaps by walking pair-wise alignments down the tree towards the root. The student will have a fair amount of latitude in how this should be done. In addition support for saving and exporting the alignment results should be added. The student may choose to re-implement or extend the existing user interface as they see fit. There will also be some updating required (e.g., reflecting updates the OWLAPI).
Challenges 
Learning and applying approaches from the field of ontology alignment, developing methods for extending the alignment down the tree, refining the user interface (or converting it to a web-based interface if preferred).
Degree of difficulty and needed skills 
I expect this will be challenging. You should have familiarity with Java, with homology and related concepts, and with ontologies, preferably biological. Prior exposure to the ontology language OWL would be useful as would some familiarity with the concepts of ontology alignment. Because this tool places ontology alignment in an evolutionary context, the ideal student will have backgrounds in both biology and the use of ontologies. UI design skills are not an emphasis at this point.
Involved toolkits or projects 
Phylontal (disregard earlier repository at GoogleCode); Nexml package for Java; OWLAPI.
Mentor 
Peter Midford

A System for Exchanging Phyloinformatic Data

Rationale 
Many different repositories contain phyloinformatic data. Thus far, there has been no standard method for transmitting data between repositories. The Dryad Digital Repository has developed one-off methods for data transfer with selected repositories, including TreeBASE and GenBank. Dryad needs to transition from these one-off solutions to a more standard architecture. A standards-based architecture will improve the flexibility and generalizability of the data transfer process, allowing integration with a growing number of repositories and tools. The new architecture should build on the SWORD data transfer standard. Some functionality for SWORD already exists in Dryad, but it does not contain all features required to support data transfer with other systems.
Approach 
Modify the SWORD interface to Dryad, enabling it to process structured data packages in BagIt format with OAI-ORE manifests. Build a new export process for Dryad that submits data to external repositories using SWORD.
Challenges 
Since the SWORD protocol and Dryad data types are well defined, the main challenge will be to determine how to make the needed changes within the existing Dryad architecture. You will also need to ensure the new functionality is configurable. The new features should be generalizable for other repositories that run on Dryad's underlying software platform, DSpace.
Involved toolkits or projects 
Dryad codebase, DSpace codebase, DSpace SWORD interface, BagIt standard, OAI-ORE standard
Degree of difficulty and needed skills 
Moderate. You will need to be experienced with Java and XML. You must be able to discuss implementation choices with the Dryad and DSpace communities.
Mentors 
Ryan Scherle (primary), other Dryad and DSpace developers as needed.
Discussion
Link to discussion on Google Plus

Deliver Phylogenetic Knowledge from TreeBASE to the Encyclopedia of Life

Rationale 
The Encyclopedia of Life (EoL) provides species pages aggregating available knowledge about the features of all described species. Phylogenetics is a core unifying concept for biodiversity and the life sciences, and EoL would benefit by using open standards (e.g. PhyloWS, NeXML) to collect and presenting data from TreeBASE at relevant and informative locations in the EoL site.
Approach 
Write code to incorporate taxonomic (classification) data in TreeBASE and extend TreeBASE's PhyloWS API to respond to requests for content on the basis of higher names, including predicates that can search on the scope (or taxonomic coverage) of a tree. Write code that allows EoL to use this to harvest trees from TreeBASE, and present the results in a meaningful fashion (e.g. HTML4 or SVG tree images).
Challenges 
Working with Hibernate and installing TreeBASE can present a significant learning curve. TreeBASE is a large codebase, using a multiple-layer technology stack, and making changes to it requires getting familiar with how its pieces work together.
Involved toolkits or projects 
TreeBASE, NeXML
Degree of difficulty and needed skills 
Medium to Hard difficulty. Knowledge of Java, Spring, Hibernate, and XML programming; understanding of RDFa. The successful applicant needs to be a proficient Java/Spring/Hibernate programmer who can rapidly jump in on a rather large code base.
Mentors 
Cyndy Parr (EOL) (primary); William Piel; Rutger Vos; Rod Page.

Towards a fully RNA-aware alignment editor

Rationale 

Already one of the leading sequence alignment editor for protein sequences, Jalview (GPL) is currently extending its scope and strives to offer an increasing support for RNA sequences. In the last two summers of code, Jan Engelhart built on Lauren Lui's initial work introducing support for handing RNA secondary structure in Jalview. This was essential, since in contrast with proteins, alignment of RNAs is traditionally guided by the secondary structure, a simplified representation of its conformation. Such a structure-centric refinement of alignments requires specific visualization tools to curate the alignment in the context of the structure and, conversely, refine the structure to ensure its compatibility with the alignment. VARNA (GPL) is a visualization tool for the RNA secondary structure that offers coloring/highlight and annotation facilities, and Jan Engelhart's work in 2011 partially integrated within Jalview as an external visualization panel. There's still lots to do, however, and plenty can be done in a summer!

Approach 

The main objective of the project is to extend Jalview's support for RNA alignments, with a special emphasis on its communication with VARNA. More specifically, the following tasks can be proposed, depending on the familiarity of the student with RNA and hers/his technical proficiency:

  • Streamline the communication between VARNA and Jalview, allowing a better feedback from the VARNA panel (RNA sequence and structure edits) and richer data projections from the alignment (alignment, covariation...). This might include changes in VARNA's code (maintained by one of the mentors).
  • Integrate RNA-specific tools, such as web services to retrieve 2D structure from 3D, predict 2D from sequence... (as seen in paradise!), or web-based prediction tools using the JABAWS system.
  • Enrich the alignment visualization based on the concept of isosteric base-pairs.
  • Add non-canonical base-base interactions (described in this paper) and multiple secondary structures to Jalview's data model.
  • Add synchronized "side by side" synchronized layouts (already implemented in VARNA) to help in the comparison of homologous (yet not-necessarily identical) 2D structures.
Challenges 
  • Understand the architecture and design principles underlying two healthy open-source babies.
  • Propose a design for the sequence/2Dstructure/3Dmodel correspondence.
  • Be visually creative, yet concise and fully informative.
Degree of Difficulty and Needed Skills 
  • Medium to hard.
  • Good to excellent Java proficiency required.
  • Familiarity with either RNA computational biology or Web Services concepts and tools (and curiosity for the other).
Selection criteria 
  • Highly Organized
  • Java ninja skills (or keen to develop them :) )
  • Creative
Mentors
Yann Ponty (lead dev of VARNA)
Jim Procter (lead dev of Jalview)


Implement a machine-readable output format for PAML

Rationale 
PAML (Phylogenetic Analysis by Maximum Likelihood) is a popular software suite which implements a number of methods for estimating molecular evolutionary parameters by maximum likelihood. The program is controlled via a number of parameters and run modes, and output is sent to a human-readable text file. There has been significant interest in developing language-specific wrappers around PAML's functionality (e.g. in BioPython and BioPerl). Wrappers allow PAML to be used in automated pipelines and facilitate comparisons with other software for phylogenetic analysis. However, the burden of keeping up with updates to PAML's undocumented and changing text output has caused some language-specific wrappers to become outdated and unusable.
Suggested approach 
This project aims to fix the problem at the source, by altering the PAML source code to output its data in a more easily machine-readable format. This has the potential to greatly reduce the effort required to write robust PAML wrappers in any scripting language.
Flexibility 
Although the approach outlined here involves modifying PAML itself to produce a more easily-parsed format, alternative ideas are welcome. One option would be to update an existing library (in BioPerl, BioPython, or R) and add extensive unit tests to alert developers of any future incompatibilities; another would be to create an external script which translates PAML output into an easily-parsed format.
Challenges 
  • Code organization-- the main challenge would be to implement an additional output format while interfering with the existing PAML codebase as little as possible.
  • Output format complexity -- many decisions will have to be made about how much detail and complexity to include in the output format; for example, options may be required to enable or disable detailed estimates of codon frequencies or ancestral sequences.
  • Testing and documentation -- the best way to ensure that a new output format is widely adopted and future-proof is to document it well. This may involve creating unit tests with example datasets, example parsers in popular scripting languages, and extensive written documentation.
Degree of difficulty and needed skills 
  • Medium. Requires familiarity with PAML (or a willingness to learn its myriad components) and some C programming experience.
Mentors 
Greg Jordan

A richer library for calculating the maximum likelihood value of a tree

Rationale 
Beagle is a high-performance library which implements the pruning algorithm and transition probability calculations that are required to calculate a likelihood on a fully specified tree+model pair. However, using beagle still requires a high level of sophistication on the part of the client code. The beagle API does not provide functions for calculating transition probability matrices from rate matrices (beagle will calculate transition probabilities from eigen decomposition of the rate matrix). Nor does it include the ability to optimize the parameters of the models. Including these features would make it much easier for client code to implement maximum likelihood techniques. It would also be helpful to have a branch length optimizer that could handle to zero length branches correctly.
Approach 
Write a library that extends the beagle-lib API to allow maximum likelihood scores to be calculated for a fixed tree topology by optimizing model parameters while taking into account possible bounds on those parameters (such as a zero branch length). The pruning portion of the calculation (and many other function calls) can simply be delegated to calls to the beagle-lib. Numerous open-source packages for numerical optimization exist. Thus, much of this project will consist of glueing together existing libraries, and making sure that the numerical optimization routines are robust to nature of the likelihood surface on trees.
Challenges 
Finding and effectively implementing an optimization algorithm that will perform well and deal sensibly with parameter boundary conditions (e.g. zero branch lengths). Deciding on a tree structure to use.
Involved toolkits or projects 
Beagle. A library with numerical optimization capabilities (e.g. Purple or the GSL). It would probably make sense to use a data structure for a tree that is already used in other software (but none seem to be widely used in the phylogenetics community using C/C++).
Degree of difficulty and needed skills 
Medium difficulty for a programmer who is knowledgeable in C++.
Mentors 
Daniel Money and Mark Holder

How to apply

The application period closed on April 6, 2012. We got 45 applications. Thanks to all those who applied, and good luck!

Should I apply?

The students accepted into the program are very diverse in terms of background, geography and interests. We have projects ideas that require significant technical skills, and those that do not. Some would benefit from having a biology background, but this is not necessary for most. Our past students have been computer scientists interested in learning more about biological data analysis, bioinformatics students wanting to learn new technical skills, and biologists just starting out in open source software development.

The most important requirements are that you are interested in contributing to open source software development, during and beyond GSoC, and that you demonstrate that you can collaborate and communicate remotely with a distributed group of developers. It is not uncommon to rank an application from a more enthusiastic and engaged potential participant higher than one from a less engaged super-star programmer.

This is, of course, assuming you meet the basic eligibility requirements from Google.

Before you apply

The single most important thing you can do before you apply is to contact us!

  1. Look at our project ideas to get an overall sense for the kinds of projects we support.
  2. Pick the idea that appeals most to you in terms of goals, context, and required skills. If you want to apply with your own idea, contact us early on so we can try to find a mentor. Alternatively, you can suggest a candidate mentor to us if you know one.
  3. Contact us on the phylosoc@nescent.org mailing list about the project idea you have in mind. This will reach all mentors and our organization administrators. Other student applicants will not see your message.
  4. Write a project proposal draft, include a project plan (see below), and send that to the mailing list.

Talking to us in advance of the application deadline is necessary for a strong application. Entrance into the program is highly competitive, and it is extremely rare for us to accept a student who did not contact us in advance.

  • You can ask questions about the project idea, and bounce your project plan off of us. Students who engage with us before submitting their application always have vastly better applications and project plans.
  • We are most interested in students who give us evidence that they have already or might develop a sustained interest in becoming future contributors to our open source projects. How do you convey that if the first time we see something from you is when we score your application?
  • The value of frequent and early communication in contributing to a distributed and collaboratively developed project can hardly be overemphasized. The same is true for becoming part of a community, even if only temporarily. If you can't find the time to engage in communication with your prospective organization before you have been accepted, how will you argue that you will be a good communicator once accepted?

If you are interested in hearing from our previous Summer of Code mentors or students, you can also send email to wg-phyloinformatics@nescent.org. It's a larger list, so in that case try to be specific about what kinds of people you'd like to hear back from.

Writing your application

When applying, (aside from the information requested by Google) please provide the following in your application material.

  1. Name, email, university and current enrollment, short bio
  2. Your interests, what makes you excited.
  3. Why you are interested in the project, uniquely suited to undertake it, and what do you anticipate to gain from it.
  4. A summary of your programming experience and skills, including programs or projects you have previously authored or contributed to, in particular those available as open-source, including, if applicable, any past Summer of Code involvement.
  5. A project plan for the project you are proposing, even if your proposed project is directly based on one of the ideas above.
    • A project plan in principle divides up the whole project into a series of manageable milestones and timelines that, when all accomplished, logically lead to the end goal(s) of the project. Put in another way, a project plan explains what you expect you will need to be doing, and what you expect you need to have accomplished, at which time, so that at the end you reach the goals of the project.
    • Do not take this part lightly. A compelling plan takes a significant amount of work. Empirically, applications with no or a hastily composed project plan have not been competitive, and a more thorough project plan can easily make an applicant outcompete another with more advanced skills.
    • A good plan will require you to thoroughly think about the project itself and how one might want to go about the work.
    • We strongly recommend that you bounce your proposed project and your project plan draft off our mentors by emailing phylosoc@nescent.org (see below). Through the project plan exercise you will inevitably discover that you are missing a lot of the pieces - we are there to help you fill those in as best as we can.
    • Note that the start of the coding period should be precisely that - you start coding. That is, tasks such as understanding a code base, or investigating a tool or a standard, should be done before the start of coding, during the community bonding period. If you can't accomplish that due to conflicting obligations during the community bonding period, make that clear in your application, and state how you will make up the lost time during the coding period.
    • For examples of strong project plans from prior years, look at our Ideas pages from those years, specifically the Accepted Projects section (e.g., 2008, 2009, 2010, 2011). Visit the listed project pages, many have the initial project plan (which was subsequently updated and revised, as the real project unfolded). Good examples are Automated submission of rich data to TreeBASE, PhyloXML support in Biopython, or Mesquite Package to View NeXML Files.
    • We don't expect you to have all the experience, background, and knowledge to come up with the final, real work plan on your own at the time you apply. We do expect your plan to demonstrate, however, that you have made the effort and thoroughly dissected the goals into tasks and successive accomplishments that make sense.
  6. Your possibly conflicting obligations or plans for the summer during the coding period.
    • Although there are no hard and fast rules about how much you can do in parallel to your Summer of Code project, we do expect the project to be your primary focus of attention over the summer. Our view is that it should be the same as an internship you take up over the summer, except this one is remote (and almost certainly more fun :-) If you look at your Summer of Code project as a part-time occupation, please don't apply for our organization.
    • That notwithstanding, if you have the time-management skills to manage other work obligations concurrent with your Summer of Code project, feel encouraged to make your case and support it with evidence.
    • Most important of all, be upfront. If it turns out later that you weren't clear about other obligations, at best (i.e., if your accomplishment record at that point is spotless) it destroys our trust. Also, if you are accepted, don't take on additional obligations before discussing those with your mentor.
    • One of the most common reasons for students to struggle or fail is being overstretched. Don't set yourself up for that - at best it would greatly diminish the amount of fun you'll have with your Summer of Code project.
  7. (optional) If the project idea you are applying for suggests or requests completion of a coding exercise, provide a pointer to your results (such as code you wrote, output, test results). Not all project ideas have this requirement; if in doubt, ask by emailing the mentors list.

General GSoC information and timelines

Timeline

The full timeline is on the Google FAQs. Highlights:

  • Mentoring organizations (i.e. NESCent) apply between Feb 27 - March 9, 2012.
  • Students apply online between March 26-April 6, 2012.
  • Mentors score the applications, and Google allocates slots to each organization. If Google allocates N slots to an organization, the top N scoring students for that organization get accepted. Successful applications will be announced on April 23, 2012.
  • April 22-May 20 is the Community Bonding Period, when students are expected to acquaint themselves with the project, the code bases involved, and the respective developer community, as well as set up everything needed for developing and committing (code repository access, mailing list subcription, software and dependencies installation, compiling, etc).
  • The coding period is from May 21 to August 20, 2012.

FAQs