Phyloinformatics Summer of Code 2013

From NESCent Informatics Wiki
Jump to: navigation, search

We just finished another successful year as a mentoring organization for the Google Summer of Code! Rather than mentoring students in a single open-source project, NESCent is an umbrella organization for many projects in evolutionary informatics. This means that NESCent projects use a wide variety of technologies and programming languages. We also participated in the GNOME Outreach Program for Women, which runs in parallel with Google Summer of Code.

You can read a summary of our 2013 program on the Summaries page.


Contents

Contact information

How does GSoC work?

Very briefly, Google sponsors students to spend the summer working on open-source programming projects with mentoring organizations. NESCent is one such organization. Google provides tons of documentation about GSoC:

Accepted projects

We have accepted 7 students through GSoC and 1 through the GNOME Outreach Program for Women.

This is the coding period for Summer of Code. Everyone should be busy writing, testing and documenting code. Check the project pages (links below) to keep up with progress.

Phylet: Open Tree of Life Graph Visualization and Navigation

  • Abstract: Phylet is a visualization of the tree of life being constructed by the Open Tree of Life project. Currently in alpha release, Phylet runs off of a database with a small portion of the tree (angiosperms), using the graph database Neo4j to power the information and the d3 JavaScript library to power the visualization. This project will focus on optimizing the visualization, experimenting with different formats and using usability tests to make it approachable to a wide range of audiences with differing levels of programming literacy.
  • Student: Chanda Phelan
  • Email: cdphelan at umich.edu
  • Mentors: Stephen Smith, Gabriel Harp, Stephanie Poppe, Mariano Cecowski
  • link at GNOME
  • Project plan
  • Bitbucket repository
  • Blog about the summer's progress

Extend PartitionFinder to automatically partition DNA and protein alignments

  • Abstract: ParitionFinder is a piece of phylogenetics software that combines similar sets of sites in a DNA or amino acid alignment into a partitioning scheme. This project will expand the utility of the software by implementing a new algorithm to automatically split either user defined subsets or entire alignments into one or more new subsets using site-specific likelihoods under different rate categories.
  • Student: Paul Frandsen
  • email: paulbfrandsen@gmail.com
  • link to project home page
  • link to project source code repository
  • link to Melange project page
  • Mentors: Rob Lanfear and Brett Calcott

Identifying problems with gene predictions

  • Abstract: The goal of the project is to validate predicted genes by computing a confidence score and suggesting possible errors/untrusted regions in the sequence. The results of the prediction validation will make evidence about how the sequencing curation may be done and can be useful in improving or trying new approaches for gene prediction tools. The main target users of this tool are the Biologists who want to validate the data obtained in their own laboratories.
  • Student: Monica-Andreea Dragan(email)
  • Mentors: Anurag Priyam(email) and Yannick Wurm(email)
  • GSoC Proposal: link to Melange project page
  • Project Homepage: homepage
  • Watch project progress: project diary
  • Blog: blog
  • Source code: github

Extend PhyML to use the BEAGLE library

  • Abstract: This project aims to use CUDA/OpenCL in PhyML (via the BEAGLE library) for its likelihood calculations. This is the core computation performed whilst determining log-likelihood score of an inferred phylogeny. The parameter space is explored using these likelihoods, and secondly, partial likelihoods are aggregated over sub-trees yielding the final score. The challenge is in interpreting various Phylogenetic parameters (tree topology, evolutionary rate categories, eigen decomposition of the substitution matrix, etc) and making the appropriate BEAGLE calls (which are subsequently pipelined to the CPU/GPU). I look forward to contributing in a meaningful way. Moreover, this project will also help me to align my graduate research direction.
  • Student: Imran Fanaswala
  • link to Melange project page
  • Source Code
  • Mentors: Andrew Rambaut, Marc Suchard & Stéphane Guindon

Implementing Machine Learning Algorithms for Classification and Feature Selection in mothur

Codon Alignment and Analysis in Biopython

  • Student: Zheng Ruan
  • Email: zruan1991 at gmail.com
  • Abstract: A codon alignment is an alignment of nucleotide sequences in which trinucleotides correspond directly to amino acids in translated product. Codon alignment is useful as it distinguished different types of nucleotides substitution and is frequently used to test neutrality and calculate selection pressure. My job in the project is to develop codon alignment support for biopython. This will facilitate quick and large scale evolutionary analysis in biopython environment.
  • link to Melange project page
  • Project Source Code
  • Project Home Page
  • Mentors: Eric Talevich, Peter Cock

Phylogenetics in Biopython: Filling in the gaps

  • Abstract:Biopython is a set of open source python packages and modules for bioinformatics works. In the Bio.Phylo package, there are already implementations for some basic phylogenetics tasks: basic tree operations, parsers for Newick, Nexus and PhyloXML, and wrapers for Phyml, Raxml and PAML. While there are some important components that remain to be implemented to better support phylogenetic workflows. These include simple tree construction algorithms(UPGMA, NJ and MP), consensus tree searching(Stric, Majority-Rule and Adam Consensus), tree comparison and tree visualization. In this project, the first two will be implemented.
  • Student: Yanbo Ye
  • Email: yeyanbo289@gmail.com
  • link to project homepage
  • link to Melange project page
  • link to project sourcecode
  • link to project blogs
  • Mentors: Mark Holder, Jeet Sukumaran, Eric Talevich


Project ideas

This is our list of potential project ideas.

If you fancy being a student with NESCent:

  1. Join the Phyloinformatics community on Google Plus.
  2. Post an introduction and some info about which project you are interested in.

If you fancy being a mentor or co-mentor:

  1. Join the Phyloinformatics community on Google Plus.
  2. Add your name to our list of mentors
  3. Add a project idea below: Sign in, and then click the edit link to the right of this section. There is a template in comment tags. Check out the tips and guidelines on creating project ideas. Once you have created an idea here (or before!), please post a note on the Phyloinformatics community page - this is where we will have our discussions (rather than the mailing list from previous years).


Phylet: Open Tree of Life Graph Visualization and Navigation

Rationale 

Evolutionary phylogenetics studies and links all biodiversity through a shared evolutionary history. This project assist in the production and visualization of the first online, comprehensive first-draft tree of all 1.9 million named species, accessible to both public and scientific communities. Assembly of the tree incorporates previously-published results, with strong collaborations between computational and empirical biologists to develop, test and improve methods of data synthesis. Visualization of those data is a central task for communicating uncertainties, reporting on relationships, and formulating new, leading hypotheses.

Approach 

Phylet is an alpha-release web application that visualizes the Open Tree of Life using a graph network visualization (Spring Forced Directed Acyclic Graph algorithm using D3 JavaScript library ). Thus far, we have accomplished a partial graph visualization using a subset of available data, an incremental graph API (HTTP json requests), Client-Server data implementation (neo4j, Python, py2neo, web-cache), database and local searches (JavaScript, Python), and a Smart Undo/Session interface (JavaScript). To see the Phylet in action check out the Phylet Test Implementation.

Phylet Bitbucket Code Repository

Challenges 

We've implemented a variety of rich data navigation features, but we need to continue developing these features and refining them into seamless workflows that match specialized and generalist researcher desires for navigability, narration, context, browse-ability, and search friendliness. The aim is to enhance a user's understanding of the data and its relationships, allowing them to answer specific questions as well as form new questions about the data.

Challenge One: refine and develop the user interface, work flows, and navigability for specialized and generalist users. (HTML, CSS, Javascript, Bootstrap.js, D3.js)

Challenge Two: develop solutions for automatic information extraction and data mining of the available data, serving it to the UI for better navigability and insight into relationships and features of the dataset. (Python, neo4j, py2neo)

Involved toolkits or projects 
Degree of difficulty and needed skills 

Depending on one's expertise and specialty, the project would be easy-moderate to moderate-difficult. One should understand or be willing to improve one's skills in object-oriented programming, collaborative coding (e.g. bitbucket, mercurial), javascript libraries, APIs integration, and the ability to link functionality across different technologies (e.g. HTML5/Python). One should have a good intuition for data analysis and synthesis, the ability to ask questions, and be curious about how data visualization can be abstracted and simplified for use by specialized and generalist users. A smart design sensibility (both behavioral and visual) are critical, as is self-direction and the motivation to define and solve ambiguous problems without direction or clearly stated goals.

Mentor 
  • Stephen Smith: Ann Arbor, MI
Co-mentors 
  • Gabriel Harp: San Francisco, CA
  • Stephanie Poppe: Bloomington, IN
  • Mariano Cecowski: Ljubljana, Slovenia


Extend PartitionFinder to automatically partition DNA and protein alignments.

Rationale 

PartitionFinder is software that co-estimates partitioning schemes and models of molecular evolution for DNA and amino acid alignments. Partitioning allows different sets of sites in an alignment to be modelled independently, and choosing the right partitioning scheme can dramatically improve the estimation of phylogenetic trees, rates of evolution, and dates of divergence. Partitioning is an important step in phylogenetics, especially with large multi-locus datasets, and PartitionFinder is becoming increasingly widely used:

https://github.com/brettc/partitionfinder http://mbe.oxfordjournals.org/content/29/6/1695

PartitionFinder works by joining together similar sets of sites in an alignment. This is a good start, but it still requires the user to define the initial sets of sites, such as the codon positions of protein-coding genes. There would be considerable benefit in a method that could partition a dataset from scratch, either in the absence of user-defined sets of sites, or in combination with them.

There are many ways that a given set of sites could be divided into more than one set of sites, for example using the rates of evolution of different sites, or the likelihoods of sites under different models of evolution. Implementation of these methods could provide huge improvements in partitioning and phylogenetic inference, especially for large genome-scale datasets that don't have very obvious a-priori partitioning schemes, such as those derived from ultra-conserved elements.

Approach 

This project would involve implementing and testing methods to divide up sets of sites in DNA and amino acid alignments in PartitionFinder.

Challenges 

There would be four challenges:

  • Writing methods to split sets of sites into two or more smaller sets;
  • Implementing those methods in PartitionFinder;
  • Incorporating the methods into existing algorithms in PartitionFinder;
  • Testing the methods using real and simulated data.
Involved toolkits or projects 

PartitionFinder Python

Degree of difficulty and needed skills 

Moderate. You would need to be familiar with Python, and familiar with (or willing to learn) a bit of molecular phylogenetics.

Mentors 

The mentors are a combination of a molecular-evolutionary biologist and Python programmer (Rob Lanfear), and a former professional Python programmer turned philosopher and molecular biology software programmer (Brett Calcott).

  • Rob Lanfear
  • Brett Calcott


Identifying problems with gene predictions

Rationale 

Genomes of emerging model organisms are now being sequenced at almost no cost. However, obtaining accurate gene predictions remains challenging. Even the best gene prediction algorithms make substantial errors, which lead to misleading results in genome wide analyses of single species (e.g. for studying gene expression using RNAseq) and in multi-species comparisons (e.g. for studying gene family evolution), or even make such analyses impossible.

Goal 

The aim of this project is to build a software tool that takes as input a set of gene predictions from a FASTA or a GFF3 file and identifies potential problems with each gene sequence; so problematic genes can be fixed or eliminated before analysis.

Approach 

For each predicted gene identify significantly similar sequences in the Uniprot/Swissprot reference database using BLAST. Using the information provided in the BLAST results, propose and implement algorithms to answer the following questions:

  • Is the length of the focal sequence within the distribution of the lengths of similar sequences?
  • Is there evidence that the focal sequence is missing a domain present in similar sequences?
  • Is there evidence that the focal sequence actually represents two or more genes that were incorrectly merged?
  • Is there evidence that a single real gene was incorrectly split into two or more gene predictions?
  • Does the focal sequence contain a frameshift that would lead to a nonfunctional protein?

Significance scores for each answer for each gene prediction should be output to an easily readable file.

The formats of input files, locations of BLAST binaries and reference protein databases should be configurable and / or automagically detected.

It is strongly recommended to follow Behavior Driven Development (BDD). Proposals with specifications (in minitest/spec format) together with a rough timeline will be considered.

You can learn about BLAST from Wikipedia, the BLAST publication or SequenceServer. Feel free to suggest complementary approaches to help identify potential issues in a gene prediction.

Deliverables 

Documented source code with revision history available on Github, with supporting documentation / simple website on how to use it, expected input, and the output format. The program should be easily installable as a RubyGem. If successful this work may result in co-authorship on a scientific publication.

Mentors 
Links 
  • PDF of the project idea.


Extend PhyML to use the BEAGLE library

Rationale 
PhyML is a software that estimates maximum likelihood phylogenies from alignments of nucleotide or amino acid sequences. The main strength of PhyML lies in the large number of substitution models coupled to various options to search the space of phylogenetic tree topologies, going from very fast and efficient methods to slower but generally more accurate approaches. PhyML was designed to process moderate to large data sets. Much of the speed comes from heuristic search techniques to efficiently search the space of possible trees. Parallelization is limited to MPI-based distribution of bootstrap replicates. Like most such software it uses the Felsenstein Pruning algorithm to compute the likelihood of individual trees using continuous-time Markov models. Recently considerable advances has been made in the fine-scale parallelization of this algorithm, in particular targeting massively parallel hardware such as NVidia GPGPUs (general purpose graphics processing units). There are considerable improvements in overall speed to be gained for PhyML by combining efficient search strategies with high speed likelihood computation.
BEAGLE is a cross-platform library that implements Felsenstein algorithm on a range of parallel and vector hardware including CUDA-based GPGPUs and SSE instructions on Intel chips. This project would involve extending PhyML to make calls to the BEAGLE library in place of the internal likelihood calculations.
Approach 
  • Become familiar with the API of BEAGLE and linking of client software to the library (an example simple client in C is available in the BEAGLE package).
  • Understand the likelihood calculations in PhyML.
  • Replace likelihood calculation calls in PhyML with homologous calls to BEAGLE.
  • Test and validate.
  • Performance testing with different hardware and data sets.
Challenges 
  • Initializing BEAGLE with appropriate data from PhyML (data representation transformations may be required).
  • Creating appropriate tree update messages to pass to BEAGLE for PhyML's search strategies.
  • Optimization to provide significant speed ups.
Involved toolkits or projects 
PhyML and BEAGLE library.
Degree of difficulty and needed skills 
Medium-hard, C/C++, command line build tools, molecular evolutionary models & maximum likelihood phylogenetic algorithms.
Mentors 

Extending Jalview for bacterial genomic sequence alignments

Rationale 
Jalview is primarily designed for handling alignments of RNA and Protein where all the sequences are arranged in the same orientation, e.g. N-C or 5-prime to 3-prime. For certain types of bacterial sequence alignment, however, the homologous regions are only alignable after permuting or rotating one or more sequences to put all the homologous regions in the same order. The aim of this project is to introduce new features in Jalview to allow cyclic permutations to be created, visualized, and potentially, new web services to be accessed for computing optimal cyclically permuted alignments.
Approach 
The main task is to enhance Jalview's sequence alignment model so that circular permutations can be created and edited interactively. A second objective is to allow Jalview to access web services that return alignments involving circular permutations (such as CSA, linked to below). There are also minor visualization and user interface enhancements needed to support users working with bacterial genomic sequence data.
Challenges 
Adapt Jalview's editing and undo-system to introduce new datamodel operations and key-strokes for rotating a sequence in the alignment view. Create emitters/parsers for flat-file representations of cyclically permuted sequences that enable Jalview to import files generated by command line and web-based tools such as CSA.
Involved toolkits or projects 
Jalview and cyclic-permutation and alignment tools such as CSA.
Degree of difficulty and needed skills 
Java. Easy to Medium.
Mentors 
  • Primary: David Booth - lecturer at University of Dundee[1]
  • co-mentor: Jim Procter - primary developer of Jalview.

Implementing semantically rich NeXML I/O in R

Rationale 
The R programming language is seeing increasingly broad adoption in evolutionary informatics, for example in the popular ape package for phylogenetics and evolution, the ouch package for comparative analysis, the back-end for the DateLife service and client code for querying treebase. The NeXML format enables richly annotated phylogenetic data, such as is produced by TreeBASE and such as might be embedded in the output of web services such as DateLife and others. Although I/O libraries for NeXML exist in perl, python, ruby, java and javascript, no such library exists for R. We believe that a lightweight API for reading and writing NeXML such that the data are compatible with commonly-used objects (such as those from the "ape" package), and the metadata can be handled as simple dictionary-like structures would have considerable impact for R users.
Approach 
The main task is to build a library that reads and writes NeXML data using one of the popular XML parsers for R. When reading the data, it should be possible to fetch trees, alignments and taxa as objects with a familiar "feel", e.g. by making them compatible with "ape" objects. Likewise, when writing, users should be able to pass in objects that come from other libraries. In addition, metadata annotions should be exposed as simple key/value pairs (with some extra syntax sugar) so that users can take advantage of the full richness that the NeXML standard enables.
Challenges 
It is important that considerable time is allocated for implementing the metadata system as this needs to work well to make it worth using NeXML as opposed to other standards.
Involved toolkits or projects 
NeXML and R.
Degree of difficulty and needed skills 
Difficulty is Medium: competence in R programming (i.e. being able to create a package for CRAN) and XML processing is required, as well as the ability and willingness to learn concepts related to semantic annotation (RDF, triples).
Mentors


Automated Tree Reconciliation using Phylotastic trees

Rationale 
Reconciliation of gene trees and species trees— a major use-case in bioinformatics, commonly used in genome annotation pipelines— represents an excellent case for leveraging phylotastic services, which provide species trees on the fly. A proof-of-concept application called reconcili-o-tastic illustrates the basic idea: (1) start with a tree of gene (or protein) sequences, (2) use the sequence IDs to query for species names, (3) use the list of species to get a phylotastic species tree, and (4) reconcile the gene tree and the species tree, displaying the results for the user. The proof-of-concept application is very limited-- it only works on a small set of pre-loaded files, it is hard-coded to use only the "mammals" tree accessed via Rutger's Vos phylotastic pruner, and it only uses Chris Zmasek's GSDI engine for reconciliation. The aim of this GSOC project is to move toward a more full-featured application that is well designed to satisfy user needs, and that will attract attention from potential users and sponsors.
Approach 
The approach will vary depending on how the student formulates the project in consultation with the mentors. Some possible directions are:
  • make it a more flexible phylotastic client, able to leverage a greater range of phylotastic services.
  • make it a more flexible reconciliation workbench, able to use different reconciliation engines (and parameters).
  • make it a more flexible user-oriented informatics tool, able to offer different upload, processing, download and visualization options. For instance, additional tools, including interactive ones, may be needed to determine the set of species implicated by a gene or protein phylogeny.
Challenges 
  • if this isn't re-implemented, multiple languages are involved— Javascript front end, embedded Java viewer, web2py Python controller, back-end Forester library in Java with business logic (handling trees, parsing names, and doing reconciliation). However, this could be re-implemented in Java-only with GWT for front-end, Forester for back end, J2EE for authentication/REST services. In fact, most of it could be done in Archaeopteryx (forester), as suggested by this demo.
  • allowing the user to upload files is essential, but this opens up a lot of issues (authentication, sessions, format validation, logging, etc). Web2py has some built-in support for that.
  • making this a better phylotastic client would be useful, but the phylotastic architecture is still in flux, and there aren't many services to choose from.
  • developing a generic API for plugging in different reconciliation engines would be great, but different reconciliation algorithms have different requirements (e.g., some require a fully resolved tree), and implementations may use different input and output formats (as shown in this comparison table).
Involved toolkits or projects 
Christian Zmasek's "forester" Java library is central to the proof-of-concept implementation. Other parts can be changed. The proof-of-concept code on github uses web2py and JavaScript (JQuery).
Degree of difficulty and needed skills 
  • Required skills. Java is a must. To work with the current implementation, one also needs familiarity with Javascript and Python, but that isn't necessary if some parts are re-implemented in Java.
  • Medium difficulty. The applicant needs to be comfortable working with several different packages and languages in order to integrate components (a back-end library with business logic for managing trees, a reconciliation engine, and a flexible tree viewer) into a web-based program. Furthermore, the applicant must be able to work with the mentors to become thoroughly familiar with the code, the reconciliation problem, and relevant resources, so that design decisions can be made in the first two weeks.
Mentors 

Implementing machine learning algorithms for classification and feature selection in mothur

Rationale

Microbial communities are found in almost every environment and have been implicated in many biological processes from climate change to human health and disease. The free and open source program mothur provides a toolkit for microbial ecologists to analyze DNA sequence data. Most microbial ecologist rely on metagenomics techniques to study communities as most bacteria are unable to be cultivated in the lab. Classification is one of the most important statistical tools in the ecologist's toolkit. We are interested in developing and implementing machine learning algorithms for classification and feature selection in mothur including SVM and ENET.

One of our main interestes in machine learning is feature selection. Researchers will almost always know where their data is coming from (healthy patient, diseased patient) but what they are interested in is what features of these communities are important in classifying the data as such. However, there are some specific cases where an unsupervised clustering method would be helpful, so we've included a bonus objective of implementing an unsupervised method. In this case an unsupervised method might be helpful for researchers that are trying to look for patterns within a classification. For example, within diseased patients are there clusters that correspond with disease severity? These types of classification are more suited to an unsupervised approach.

Bonus Objectives
  • We're also interested in implementing an unsupervised learning method. If you're interested in doing this include it in your proposal and timeline. However, this is a bonus objective for this project. The primary focus of your proposal should be on implementing the supervised learning algorithms mentioned above. A well thought out, narrowly focused proposal will be more successful than one that tries to take on too much over the course of GSoC.
Approach
  • This project will involve writing a command for mothur that takes in shared and design files and will output the classification and list of important features.
  • We are equally interested in classifications as well as feature selections/rank. Some ML algorithms (e.g. RandomForest) can inherently do a feature ranking as well as classification. For others, these need to be explicitly handled. Please keep this in mind when submitting your applications.
Involved toolkits or projects
Degree of difficulty
Medium to difficult depending on familiarity with the above algorithms.
Needed skills
  • Knowledge of C++ is a must.
  • Familiarity with machine learning.
  • Git for version control (can be picked up fairly quickly during the bonding period).
Challenges
  • Learning the mothur codebase and understanding how commands work is vital for a successful project. To help students become familiar with the code we ask applicants to download and compile the code and that the application include a "hello world" command. A command template and instructions are available. Please email Kathryn if you need help getting started.
  • The data we normally get have the characteristics of high dimensions/features (upto 4K), low number of training samples (only upto 200-300 instances) and sometimes sparse features/dimensions. This is actually the classical case of p >> n as mentioned in various ML books. This characteristics of the data makes it a little challenging to work with and get desired results.
Coding Challenge
Please visit the page for more details.
Resources
Mentors

Create an educational interface to query the Tree of Life geospatially

Rationale
PhyloGeoTastic is an open-source web application designed to make it easy to extract a tree of life for species within a geographic range. A prototype that draws species occurrence data from iNaturalist and public portions of GBIF was built during the recent Phylotastic hackathon and is available online here. We are enthusiastic about the educational and scientific possibilities presented by PhyloGeoTastic, but the prototype is still quite basic. This project involves fleshing out and expanding the PhyloGeoTastic functionality, documentation, and liaising with additional data sources and scientific partners.
Approach
The goal of this project is to turn the current demo interface into a stable project ready for use by educators, scientists, and open to contributions by future volunteer programmers or educators.
Challenges
Half of the challenges will be technical: use Javascript, HTML, and some simple Perl scripting to flesh out the existing demo. Specific challenges include adding example geographic areas / datasets, ensuring the app works reliably with all possible geographic data sources, and creating a useful visualization / printable output of the extracted tree of life. The other half of the challenges will be non-technical, involving documenting the interface, communicating with Phylotastic team members to ensure the tree-extraction stays functional, and working with educators and data-source administrators to identify areas of new development.
Involved toolkits or projects
The website currently runs as a simple HTML/Javascript application, with the server-side components (which query the external data sources and call the tree-extraction web service) written in Perl. The visualization module needs development; a client-side tree visualization toolkit such as jsPhyloSVG might be used.
Degree of difficulty and needed skills
The degree of technical difficulty is fairly low, as the major components are already built and in a somewhat-working state. More important is an ability to effectively communicate with many disparate parties involved in providing data for, or using, PhyloGeoTastic. Successful engagement with these stakeholders will ensure the longer-term success of this project.
Mentors
  • Greg Jordan (primary)
  • Julie Allen (primary)
  • Brian Sidlauskas (advisory)

Discovering links to ToLWeb content from a tree in the Open Tree of Life's software system

Rationale 
The Tree of Life Web Project is a mature and extremely valuable resource for summarizing phylogenetic knowledge. Its particular strengths are its large number of well written "taxon pages" which reflect an expert's synthetic assessment of the systematics of a clade. These typically include images of the species, a discussion of the biology of group and relevant references. ToLWeb also hosts educational outreach materials in its treehouses. The Open Tree of Life project also aims to summarize phylogenetic information, but that effort is focusing on producing synthetic estimates of the full tree of life from a large (and ever-growing) set of input datasets. The Open Tree of Life's main site for web services is still under development (see opentree), but the system for ingesting source trees and producing synthetic, supertrees is functioning (see treemachine). The ToLWeb and Open Tree of Life projects have complementary goals. Providing a mechanism for visitors who are browsing a phylogeny on the Open Tree of Life site to discover relevant links to content in ToLWeb would make a wealth of systematic knowledge more accessible. In addition to making the content accessible to users of the Open Tree of Life system, this project would also provide code that could act as a client of the Tree of Life Web Project's web services. This client code could be used by other projects that present phylogenetic information to the public.
Approach 
In order to suggest an appropriate link to users, the software rendering trees in the opentree system will need to infer whether or not an internal node in a tree has a taxonomic name and whether the taxon has a taxon page in TolWeb. ToLWeb provides dumps of its phylogenetic backbone in XML. Taxonomic names in the Open Tree of Life system are matched to an internal taxonomy, which is exposed via web services (see taxomachine for the code associated with maintaining the taxonomy and providing RESTful web services associated with the taxonomy). The first step in this project will consist of writing software to align the taxonomic names/ids used by ToLWeb to the names/ids in the Open Tree taxonomy. A web service could be implemented in taxomachine which takes an Open Tree name/id and returns a (possibly) empty list of taxonomic names and ids in the ToLWeb system which correspond to that name. Trees stored in the Open Tree of Life's tree database have links from the leaves of the tree to the taxa they represent. But internal nodes in the trees do not (necessarily) have taxonomic names. Code will have to be written which "decorates" internal nodes of a tree in the Open Tree of Life names with a set of possible taxonomic names. Given a service to label internal nodes of trees with taxonomic names and a service to look up a ToLWeb name/id for an Open Tree name, it will be possible to provide link outs to relevant ToLWeb pages.
Challenges 
  • While the alignment of the Open Tree taxonomy and the ToLWeb could be written in almost any language, the need to extend the taxomachine and opentree code bases will require knowledge of both Java and Javascript.
  • The webservices for serving trees from treemachine are under active development, and are not well-tested.
  • The javascript tools for rendering trees from the Open Tree of Life system is still under active development.
Involved toolkits or projects 
The NeXML dump of TolWeb will be used. The taxomachine and opentree repositories mentioned above will be extended.
Degree of difficulty and needed skills 
Moderate. Knowledge of Java, Javascript, the git version control system, XML parsing, and the use of web services will be required to complete the project.
Mentors 
  • Mark Holder
  • Karen Cranston

Automated labeling of higher-order taxa in a tree from a taxonomy

Rationale 

In many published phylogenies, only the terminal nodes are labeled with the most fine-scale taxonomic unit contained in the tree. It would be useful to be able to label the ancestors of these nodes according to higher-level taxonomic groups: e.g., for each genus found in the tree, find the common ancestor of all species in that genus, and label it with the name of the genus, and repeat for families, orders, etc. This would allow creating subsets of large trees that contain only a desired group (e.g. find only the family "Hominidae" from the Bininda-Emonds mammal tree) or to quickly find the distance between two higher-level groups.

Approach 

We need to develop a Python script that accepts a tree and a taxonomy as inputs and recursively walks through unlabeled nodes in the tree, using the taxonomy as a reference to label them where appropriate. BioPython would be an ideal library to use for this project as it would allow us to support many input/output formats.

Challenges 

Taxonomic synonymy could pose a challenge.

Involved toolkits or projects 

BioPython's Bio.Phylo package, which can read and write trees in most standard formats.

Degree of difficulty and needed skills 

Intermediate Python scripting ability and a solid understanding of recursion. Some basic understanding of phylogenies would be helpful.

Mentors 
  • Ben Morris
  • Hilmar Lapp

Use a Taxonomic Name Resolution Service to annotate an RDF tree with taxonomic synonyms

Rationale 

Sometimes different sources will refer to the same species in multiple ways. For example, the three names "Crypturellus erythropus," "Crypturellus columbianus," and "Crypturellus saltuarius" look like three distinct species, but they're actually three subspecies of a single species of bird. A difficult problem is to take user input (which could be any of these three names) and try to match it to a single node on a phylogenetic tree (which could be labelled with any of these names.) This is the job of a Taxonomic Name Resolution Service (see for example http://tnrs.iplantcollaborative.org/).

The Phylotastic RDF Treestore (https://github.com/bendmorris/rdf-treestore) uses BioPython to convert phylogenetic trees from various formats into RDF and store them in a triple store (currently Virtuoso.) This triple store can then be queried in various ways. It needs a way to match a node to any name that a user might use to refer to it. Therefore, we need to use TNRS to label the nodes in an RDF tree with any of the potential names that might be used to refer to them.

Approach 

We need to develop a Python script that works on RDF representations of trees. The script would query a Taxonomic Name Resolution Service to find potential matches for each node and store those matches within the triple store. We would also need some method of disambiguating queries that might have several possible results due to names that might match multiple nodes.

Challenges 

Taxonomic synonymy could pose a challenge.

Involved toolkits or projects 
  • BioPython's Bio.Phylo package, which can read and write trees in most standard formats.
  • The RDF treestore (linked above)
Degree of difficulty and needed skills 

Python scripting ability; understanding of semantic web technologies, specifically RDF and SPARQL.

Mentors 
  • Ben Morris
  • Hilmar Lapp
  • Jim Balhoff



A Multiple Sequence Alignment (MSA) Editor for iPad

Rationale 

Base-By-Base is a whole genome pairwise and multiple alignment editor. The program highlights differences between pairs of alignments and allows the user to easily navigate large alignments of similar sequences. Although Base-By-Base was intended as an editor and viewer for alignments of highly similar sequences, it is also provides many of the functions of other generic alignment editors.


Two papers describing features of Base-By-Base 

The iPad is a natural fit for a mobile MSA editor, and perfect for taking the problem to puzzle over during coffee or the commute home. The finger-flicks etc mimic how a user navigates the information in an MSA. The simple zooming and rotation from horizontal to vertical views are further bonuses.

Approach 

This project would involve selecting the features of Base-By-Base that would be useful on a mobile device, coding for iOS and providing an import/export of data connection.

Challenges 

There would be three challenges:

  • Feature selection/representation in iOS;
  • Implementation;
  • Connection to data sharing interface.


Degree of difficulty and needed skills 

Moderate.

Mentors 

The mentor has been developing the tool for many years and, most importantly, is a prime user of the software.

Codon alignment and analysis in Biopython

Rationale
A codon alignment is an alignment of nucleotide sequences in which the trinucleotides correspond directly to amino acids in the translated protein product. This carries important information which can be used for several analysis, notably analysis of selection pressures. This project extends Biopython to support this data type and these analyses.
Approach & Goals
Useful features include:
  • Conversion of a set of unaligned nucleic acid sequences and a corresponding protein sequence alignment to a codon alignment.
  • Calculation of selection pressure from the ratio of nonsynonomous to synonomous site replacements, and related functions.
  • Model selection.
  • Possible extension of AlignIO and the MultipleSeqAlignment class to take full advantage of codon alignments, including validation (testing for frame shifts, etc.)
Difficulty and needed skills
Medium, depending on ambition. Familiarity with the Biopython's existing alignment classes and functions, or equivalents in BioPerl, BioJava or BioRuby (e.g.), will be helpful. Understanding of the practical uses of codon alignments, or at least a basic understanding of molecular biology, is important. Some basic math is involved, essentially reading a few equations and converting them to code. One useful book to have on hand is Computational Molecular Evolution by Ziheng Yang.
Mentors
Eric Talevich, Peter Cock, others welcome

Phylogenetics in Biopython: Filling in the gaps

Rationale
While the Phylo module in Biopython supports I/O and basic tree operations, there are some important components that remain to be implemented to better support phylogenetic workflows.
Approach & Goals
This "idea" is intentially left open-ended -- you can propose any features you think would be useful in Biopython. Some suggestions of ours:
  • A "Phylo.consensus" module with functions for the consensus of multiple trees. E.g.: Strict consensus, as Bio.Nexus already implements; perhaps other methods like Adams consensus.
  • A function to calculate bootstrap support for branches given a target or "master" tree and a series of bootstrap replicate trees (usually read incrementally from a Newick file).
  • Functions for comparison of two or more trees.
  • Simple algorithms for tree inference, i.e. neighbor-joining and parsimony tree estimation. For small alignments (and perhaps medium-sized ones with PyPy), it would be nice to run these without an external program, e.g. to construct a guide tree for another algorithm or quickly view a phylogenetic clustering of sequences.
  • Tree visualizations: A proper draw_unrooted function to perform radial layout, with an optional "iterations" argument to use Felsenstein's Equal Daylight algorithm. Circular radial diagrams are also popular these days. Any new function should support the same arguments as the existing Phylo.draw.
Difficulty and needed skills
Medium-to-advanced programming skill in Python -- it's important for these implementations to be reasonably efficient, though we don't aim to compete with the fastest stand-alone implementations of these algorithms. Knowledge of phylogenetic methods is critical; for reference, you might like to have a copy of Joe Felsenstein's Inferring Phylogenies. Tree visualizations are done with matplotlib.
Mentors
Eric Talevich, others welcome

Biopython: Indexing & lazy-loading sequence parsers

Rationale

Bio.SeqIO's indexing offers parsing on demand access to any sequence in a large file (or collection of files on disk) as a SeqRecord object. This works well when you have many small to medium sized sequences/genomes. However, this is not ideal for large genomes or chromosomes where only a sub-region may be needed. A lazy-loading parser would delay reading the record until requested. For example, if region record[3000:4000] is requested, then only those 1000 bases need to be loaded from disk into memory, plus any features in that region. This is how Biopython's BioSQL interface works. Tools like tabix and samtools have demonstrated efficient co-ordinate indexing which could be useful here.
Aside from being used via an index for random access, lazy-loading parsers could be used when iterating over a file as well. This can potentially offer speed ups for tasks where only a fraction of the data is used. For example, if calculating the GC content of a collection of genomes from GenBank, using Bio.SeqIO.parse(...) would currently needlessly load and parse all the annotation and features. A lazy-parser would only parse the sequence information.
Approach & Goals
Useful features include:
  • Internal indexing of multiple file formats, including FASTA and richly annotated sequence formats like GenBank/EMBL and GTF/GFF/GFF3.
  • Full compatibility with existing SeqIO parsers which load everything into memory as a `SeqRecord` object.
Difficulty and needed skills
Hard. Familiarity with the Biopython's existing sequence parsing essential. Understanding of indexing large files will be vital.
Mentors
Peter Cock, others welcome

For prospective students

The application period does not open until April 23, but you can definitely get started well before then!

Should I apply?

The students accepted into the program are very diverse in terms of background, geography and interests. We have projects ideas that require significant technical skills, and those that do not. Some would benefit from having a biology background, but this is not necessary for most. Our past students have been computer scientists interested in learning more about biological data analysis, bioinformatics students wanting to learn new technical skills, and biologists just starting out in open source software development.

The most important requirements are that you are interested in contributing to open source software development, during and beyond GSoC, and that you demonstrate that you can collaborate and communicate remotely with a distributed group of developers. It is not uncommon to rank an application from a more enthusiastic and engaged potential participant higher than one from a less engaged super-star programmer.

This is, of course, assuming you meet the basic eligibility requirements from Google.

Before you apply

The single most important thing you can do before you apply is to contact us! Please do read through the list of project ideas. Then, join our Google Plus community and post about the project that you have in mind. We can help you refine your idea to make it competitive for the program and fun for you.

Writing your application

When applying, (aside from the information requested by Google) please provide the following information:

  1. Name, email, university and current enrollment, short bio
  2. Your interests - what makes you excited?
  3. Why you are interested in the project, uniquely suited to undertake it, and what do you anticipate to gain from it.
  4. A summary of your programming experience and skills, including programs or projects you have previously authored or contributed to, in particular those available as open-source, including, if applicable, any past Summer of Code involvement. Links to code repositories online are extremely helpful here.
  5. A project plan for the project you are proposing, even if your proposed project is directly based on one of the ideas above.
    • A project plan in principle divides up the whole project into a series of manageable milestones and timelines that, when all accomplished, logically lead to the end goal(s) of the project. Put in another way, a project plan explains what you expect you will need to be doing, and what you expect you need to have accomplished, at which time, so that at the end you reach the goals of the project.
    • Do not take this part lightly. A compelling plan takes a significant amount of work. Empirically, applications with no or a hastily composed project plan have not been competitive, and a more thorough project plan can easily make an applicant outcompete another with more advanced skills.
    • A good plan will require you to thoroughly think about the project itself and how one might want to go about the work.
    • We strongly recommend that you bounce your proposed project and your project plan draft off our mentors on the G+ page. Through the project plan exercise you will inevitably discover that you are missing a lot of the pieces - we are there to help you fill those in as best as we can.
    • Note that the start of the coding period should be precisely that - you start coding. That is, tasks such as understanding a code base, or investigating a tool or a standard, should be done before the start of coding, during the community bonding period. If you can't accomplish that due to conflicting obligations during the community bonding period, make that clear in your application, and state how you will make up the lost time during the coding period.
    • For examples of strong project plans from prior years, look at our Ideas pages from those years, specifically the Accepted Projects section (e.g., 2008, 2009, 2010, 2011, 2010, 2012). Visit the listed project pages, many have the initial project plan (which was subsequently updated and revised, as the real project unfolded). Good examples are Automated submission of rich data to TreeBASE, PhyloXML support in Biopython, or Mesquite Package to View NeXML Files.
    • We don't expect you to have all the experience, background, and knowledge to come up with the final, real work plan on your own at the time you apply. We do expect your plan to demonstrate, however, that you have made the effort and thoroughly dissected the goals into tasks and successive accomplishments that make sense.
  6. Your possibly conflicting obligations or plans for the summer during the coding period.
    • Although there are no hard and fast rules about how much you can do in parallel to your Summer of Code project, we do expect the project to be your primary focus of attention over the summer. Our view is that it should be the same as an internship you take up over the summer, except this one is remote (and almost certainly more fun :-) If you look at your Summer of Code project as a part-time occupation, please don't apply for our organization.
    • That notwithstanding, if you have the time-management skills to manage other work obligations concurrent with your Summer of Code project, feel encouraged to make your case and support it with evidence.
    • Most important of all, be upfront. If it turns out later that you weren't clear about other obligations, at best (i.e., if your accomplishment record at that point is spotless) it destroys our trust. Also, if you are accepted, don't take on additional obligations before discussing those with your mentor.
    • One of the most common reasons for students to struggle or fail is being overstretched. Don't set yourself up for that - at best it would greatly diminish the amount of fun you'll have with your Summer of Code project.
  7. (optional) If the project idea you are applying for suggests or requests completion of a coding exercise, provide a pointer to your results (such as code you wrote, output, test results). Not all project ideas have this requirement; if in doubt, ask on our G+ page.

How does the review process work?

After the application deadline, all of the mentors will review your application. Our review criteria includes both the quality of the project proposal (is there enough detail in the project plan? is the timeline reasonable? what are the plans for testing?, etc) and your enthusiasm, communication ability and how likely we think it is that you will continue in open-source development. During the review process, mentors will leave public comments in Melanage - do respond to those. Students who don't communicate throughout the review process are unlikely to be likely to be highly ranked, and it is important that this communication be in Melange, not simply between you and your mentors. Remember that ALL mentors review proposals and the process is extremely competitive!