Phyloinformatics Summer of Code 2011

From NESCent Informatics Wiki
Jump to: navigation, search

We were accepted as a mentoring organization for the 2011 Google Summer of Code (GSoC) program for the fifth time in a row.

On this page we were first collecting ideas, possible projects, prerequisites, possible solution approaches, mentors, and other people or channels to contact for more information or to bounce ideas off of, etc. You can now also find the projects that were accepted with brief summaries of the goals, who the student and mentor were for each project, and where project page, source code repository, and, if the student kept one, a development blog are located. After completion of the program in August 2011 the students wrote up short summaries for each project.

For information on previous years, please see our pages of project ideas and accepted projects for our 2007, 2008, 2009, and 2010 program participatations.

Contents

News

  • For continuous updates, please check (and possibly follow) our Twitter feed.
  • Occasionally, announcements at major milestone points of the program (call for mentors and project ideas, success in applying as mentoring organization, call for student applications, etc.) will also be posted to our Phyloinformatics Summer of Code Announcement list. Please subscribe to the group if you want to stay in the loop with our program on a low-traffic basis.

Contact

Our organization administrators are Hilmar Lapp (hlapp@nescent.org) and Karen Cranston (karen.cranston@nescent.org).

If you are a student interested in applying for a Google Summer of Code project in our organization, please send any questions you have, projects you would like to propose, etc to phylosoc@nescent.org. This will reach all mentors and our Summer of Code organization adminstrators.

During the student application period we will hang out regularly on IRC at least on weekdays during working hours (EDT) in #phylosoc on Freenode. You're welcome to join us at any time, though we may not be on-line outside of those times. Alternatively, there's always email. (If you do not have an IRC client installed, you might find the comparison on Wikipedia, the Google directory, or the IRC Reviews helpful. For Macs, X-Chat Aqua works pretty well. If you have never used IRC, try the IRC Primer at IRC Help, which also has links to lots of other material.)

Before you apply, please read our guidelines and expectations . We don't have an application format that you need to adhere to, but we do ask that you include specific kinds of information (see "When you apply").

Why

Our aim is to cultivate the next generation of open-source software developers in evolutionary biology and promote the open, collaborative development of reusable, interoperable, standards-supporting informatics tools. We aim to increase the awareness and understanding of the value of open-source and collaborative development, and also impart the programming and remote collaboration skills needed to successfully contribute to such projects.

The disciplinary focus for our participation is phyloinformatics and evolutionary biology, and our community of mentors are leaders in this field. The code that students write will facilitate new, and increasingly complex, scientific questions to be answered about the patterns and processes underlying the history of life. Work on projects along the lines of our suggested project ideas is bound to make an impact. The range of problems for which students can make meaningful contributions to is diverse, enabling us to accommodate students with varying interests. Similarly, the project ideas cover a variety of programming languages and skill levels.

Accepted projects

Export ontology-based phenotype descriptions to the Encyclopedia of Life

This is an application for the Export ontology-based phenotype descriptions to the Encyclopedia of Life project idea.

Abstract: This project involves developing a system that will map the phenotypic data from an OBD database to the EOL transfer schema. This implies determining what phenotypic information can be used and creating human-readable segments of text that can be integrated in a Encyclopedia of Life page.

Student: Alexandru Lucian Ginsca

Mentors: Jim Balhoff (primary), Chris Mungall, Matt Yoder, Cyndy Parr

Project Homepage: PhyloSoC:Export ontology-based phenotype descriptions to EOL

Project blog: Ontophenotype

Source Code: github repository

PhyloGeoRef: A geo referencing library implemented in Java

This is an application for the PhyloGeoRef: A Java library for mapping phylogenetic trees and geographical information in KML project idea.

Abstract: The PhyloGeoRef library enables representation of phylogenetic data in a spatial format by constructing a phylogenetic tree. A phylogenetic tree as the name suggest is a simple tree diagram which shows the relationships among various biological species based upon similarities and differences in their genetic characteristics. The aim of this project is to develop a Java library that is able to generate rich kml's that can be rendered in google maps.


Student: Apurv Verma

Mentors: Kathryn Iverson, David Kidd

Project Homepage: PhyloSoC: PhyloGeoRef Java library for mapping phylogenetic trees in KML

Project blog: PhyloGeoRef-2011 blog

Source Code: github

Interoperable exchange of gene tree reconciliation maps

Abstract:

The goals of this project are:

  • Standardize the XML encoding of gene tree reconciliation data by extending an existing standardImplement this encoding in a standard bioinformatics library (BioPerl)
  • Modify the iPlant GTR database to support import and export of GTR trees using the new encoding
  • If time permits, modify the iPlant tree visualization tool so that it internally uses the new encoding
  • If time permits, use iPlant's taxa resolution to populate taxonomic name strings

Student: Daniel Packer

Mentors: Jamie Estill (primary), Jim Leebens-Mack, Todd Vision, Bill Piel

Project Homepage: PhyloSoC: Interoperable exchange of gene tree reconciliation maps

Project blog: Daniel Packer's blog GSOC posts on DP's blog

Source Code: github repository

Extending Jalview’s support for handling RNA

This is an application for the Extending Jalview's support for handling RNA project idea.

Abstract:

Student: Jan Engelhardt

Mentors: Jim Procter (primary), Peter Troshin

Project Homepage: PhyloSoC: Extending Jalview support for handling RNA

Source Code - General repository: http://source.jalview.org/gitweb/

Source Code - Current branch: http://source.jalview.org/gitweb/?p=jalview.git;a=shortlog;h=refs/heads/2_5_1_rna_merge

Automated submission of rich data to TreeBASE

This is an application for the Automated submission of rich data to TreeBASE project idea.

Abstract: This project will take on the task of accepting NeXML files to TreeBASE so that the submission process of metadata to be easily submitted and so that new annotations of the metadata can be displayed in a user-friendly manner.

Student: Laurel Yohe

Mentors: Rutger Vos (primary), William Piel

Project Homepage: PhyloSoC: Automated submission of rich data to TreeBASE

Project blog: PhyloSoC 2011: TreeBASE Project Blog

Source Code: TreeBASE subversion repository

Extending APE to handle incomplete distances

Abstract: This project aims to extend the APE (Analysis of Phylogenetics and Evolution) R package to handle phylogenetic reconstruction from incomplete distance matrices, by implementing agglomerative methods such as NJ* and Bio-NJ*, and also non-agglomerative ones such as the triangles method.

Student: Andrei-Alin Popescu

Mentors: Emmanuel Paradis (primary), Katarina Huber

Project Homepage: APE incomplete distance extension

Source Code: ape incomplete distance extension source code on r-forge

DIM SUM 2: GPU computing for an individual-based simulator of movement and demography

This is an application for the DIM SUM 2: GPU computing for an individual-based simulator of movement and demography project idea.

Abstract:

Student: Peter Hoffman

Mentors: Kevin Savidge (primary), Jeremy Brown

Project Homepage: PhyloSoC: DIM SUM 2 GPU computing for individual based simulator of movement and demography

Source Code: http://code.google.com/p/bio-dimsum/source/checkout

Manipulating NGS data for population genetic analysis

Abstract:This stand alone GUI will format large multi-locus datasets with no reference genome. It will allow researchers with genome-enriched next-generation sequencing data to easily view and modify any subset of their data with a single click (which is currently achieved with tedious formatting by hand and not aided by currently available genomic tools). This program will output four commonly used population genetics analysis formats and focus on user-friendliness.

Student: Sarah Hird

Mentors: Jeremy Brown (primary), Brad Chapman

Project Homepage: PhyloSoC: Manipulating NGS data for population genetic analysis

Project Blog: Sarah's(G)SoC 2011

Source Code: lociNGS repository on GitHub

Ideas

Note: if there is more than one mentor for a project, the primary mentor is in bold font. Additional information about the mentors is linked to in the Mentors section.

Students: we have listed only our own project ideas, albeit well thought-out ones. You are welcome to propose your own project provided it still falls within the scope of the aims described above. If we like your proposal, we will try to find an appropriate mentor. Better yet, propose a mentor yourself! Regardless of what you decide to do, make sure you read and follow the guidelines for students below.


Visualize ancestral state reconstructions in web-based phylogenetic tree viewers

Rationale 
Ancestral state reconstructions are fundamental to nearly all evolutionary biologists who use phylogenetic trees. Early packages like MacClade, WinClada, and more modern ones like R and Mesquite provide simple to sophisticated mechanisms for reconstructing ancestral states on trees. While there have been great advances to phylogeny visualization (and see [1]) on the web basic there is still much room for extending the utility of these approaches. Parsimony based ancestral state reconstruction should be calculable in the browser, and represents a straightforward starting point towards tackling this challenge within a web-based platform. In an ideal world visualization of the effects of matrix manipulations (i.e. coding character by OTU matrices) could be visualized in real-time. A logical starting point is to attempt this with small morphological matrices before moving to large sequence-based matrices.
Approach 
Several different levels are possible: 1) Assume that branches in NeXML are already annotated, display this annotation on the tree in the necessary format; 2) Assume that only a categorical state matrix in NeXML format is available, with this compute the ancestral state reconstructions according to some algorithm of choice (at minimum start with parsimony and differentiate between ambiguous and unambiguous state changes); 3) Calculate an ancestral state reconstruction using existing software (e.g. TNT, R, Mesquite) or software library, map via a middle layer the output to NeXML, then visualize this data in the tree viewer; or 4) A problem in reverse- allow the user to annotated a tree, and translate this annotation to a character matrix which is persisted in NeXML or perhaps in an intermediate level format like JSON which is consumable by other web APIs.
Challenges 
Extending where necessary annotation levels to NeXML's tree model; extending tree-based visualization to plot characters in clearly legible formats.
Involved toolkits or projects 
See NeXML, and examples in Winclada, R, MacClade, Mesquite. Persisting data for proof of concept work could be done in a simple BioSQL mockup. Proof of concept for web-based workflow in a platform like mx. Visualization could be done using the iPlant viz tool (or others)
Degree of difficulty and needed skills 
Simple to Hard. A simple implementation would focus on only expressing the algorithm results in NeXML, while adding the visual or algorithmic components would increase difficulty. Algorithms of use in reconstruction are well understood and should range from simple to complicated, implementing them in javascript or some real-time web-based framework could be challenging. A student could focus completely on the web-based interactions or visualization, modeling (e.g. in NeXML), or high-performance ancestral state-calculations.
Mentors 
Karen Cranston; Matt Yoder; Kris Urie

Convert tree file to 3D physical object

image of 3D tree on screen and 3D physical object (Clematis from Shapeways Gallery)
Rationale 
Depiction of phylogenetic trees is a growing problem in biology (see article in NY Times). There are dozens of approaches. Several authors have pursued ways to show trees in 3D (i.e., Sanderson's paloverde, Kim & Lee's 3DPE), but, while useful, these suffer from limited monitor resolution and the lack of true stereoscopic displays by most potential users. At the same time, computer fabrication of 3D objects has become rapidly cheaper and more accessible, with complex objects costing $10 or less and various options available (Shapeways, RepRap, $4,995 desktop printer). 3D printing may avoid some of the resolution issues associated with display on monitors. It also allows the possibility of representing more information in a tree: z-axis for time, x and y-axes for values of two different continuous characters reconstructed on the tree, branch diameters being proportional to species diversity, and so forth. Finally, and most importantly, there are immense broader impacts possible: imagine students around the world touching and learning from physical representations of the complex history of life costing just $15 per classroom and created with the software you write this summer.
Approach 
There are at least two possible approaches. The first is to take existing 2D tree software and extrude and export solid shapes from the graphics created from those, modifying the output so that one object is created (making sure labels are attached to the tree, for example). A more interesting approach is to either modify existing 3D software or write your own (perhaps starting with R and the RGL library or Processing and the superCAD library) to take a tree file (in Newick and perhaps other formats) and export it to a 3D file. The output from either approach should be a file in a standard CAD format (STL, Collada, X3D, etc.) suitable for uploading and printing, with no errors, to a site such as Shapeways.
Challenges 
  • Conversion of string to 3D object
  • Design decisions about means to include taxon or other labels on the tree
  • Assurance of compliance with chosen output format regardless of (valid) input tree
  • Design decisions about ways to draw trees (traditional square or triangle tree, bushlike, disc, cone, etc.)
  • Allowing users to adjust parameters of output (size of final object, presence of time scale, etc.)
Degree of difficulty and needed skills 
  • Medium
  • good knowledge of at least one CAD format
  • experience with 3D modeling (ideally at a low level, not just manipulating objects in SketchUp)
  • knowledge of relevant language (I can mentor in R, C, C++, Processing, PHP, and Perl, but not in Java or Python, though contact me if interested. For example, Python scripting for Blender might allow easy creation of 3D trees in Python while allowing users to then modify the trees in a friendly GUI (thanks to Kevin Lam of A*STAR Singapore for the suggestion))
Selection criteria (not exhaustive list, but might help with guidance) 
  • Quality of proposal (feasible given your skills and available time, useful outcome)
  • Interaction (response to emails, back-and-forth regarding proposal)
  • A code sample is not strictly required. However, it would strengthen the application to have code to take a string, say "Drosophila", and render it as a single 3D object file suitable for 3D printing. If you decide to do this, feel free to do it as raised letters (but connected as one object), as letters pushed into cylinder or rectangular prism, etc.
Mentors 
Brian O'Meara, David Kidd

Optimize R code for Approximate Bayesian Computation

Rationale 
Approximate Bayesian Computation is an exciting approach to biology. We have code to do this in R for comparative methods, but it is ripe for optimization. Do you hate "for" loops? Like multithreading? Want to do checkpointing? If you like "apply", please apply.
Approach 
Go through existing code and find inefficiencies. Eliminate them.
Challenges 
  • Making loops run in parallel
  • Optimizing speed
  • Utilizing HPC approaches
Degree of difficulty and needed skills 
  • Low
  • R
  • experience in HPC for R ideal
Selection criteria (not exhaustive list, but might help with guidance) 
  • Quality of proposal (feasible given your skills and available time, useful outcome)
  • Interaction (response to emails, back-and-forth regarding proposal)
  • Code sample: take any "for" loop in the TreEvo code and show how you would speed it up.
Mentors 
Brian O'Meara, Barb Banbury

Visualizing viral epidemics: integrating evolutionary and epidemiological information

Rationale 
Spatial reconstruction of the spread of West Nile virus in the US from 1998-present
RNA viruses are the causative agents for many of the world’s most important infectious diseases both in terms of human and animal health and economic burden. For example, influenza is one of the most common, and serious, respiratory infections of humans, with 3 to 5 million cases occurring annually worldwide, resulting in 250,000 to 500,000 deaths. In addition, there is an ongoing threat of large-scale global pandemics, characterized by substantial increases in morbidity and mortality. Most human and animal RNA viruses are characterized by an extremely rapid rate of evolutionary change, such that mutations are routinely fixed as viral epidemics proceed in real time. Although such evolutionary rapidity often poses a serious challenge to vaccination and antiviral therapy, it also means that the epidemiological history of the virus can be tracked by the mutations that accumulate in the virus genome. The 2009 influenza A/H1N1/09 ‘swine flu’ pandemic was the first major pandemic to occur in the modern era of genomic sequencing and high-performance computing. These technologies, concomitant with the extremely rapid rate of evolution of RNA viruses, allows us to track the spread of these infectious agents in ‘real time’ as samples are collected, sequenced and uploaded onto databases.
The goal of this project is to produce a system that can be used to present innovative visualizations of the current epidemic situation. The source data sets for these visualization would be the output of our Bayesian evolutionary analysis package, BEAST, and would include phylogenetic trees, epidemiological parameter estimates and evolutionary information such as evolutionary rates. These interactive visualizations within a strong underlying statistical framework will provide a useful resource to inform the wider public health community.
Approach 
All molecular phylogenetic/epidemiological analyses will have been done using BEAST which can incorporate date of sampling and spatial information and can infer epidemiological reconstructions, spatial movement and evolutionary dynamics. Results from a number of BEAST analysis can be provided at the start of the project. These results would then presented in an innovative, compelling and interactive way to allow the current knowledge of epidemic to be intuitively communicated. Spatial distributions will be displayed using mapping software such as GoogleMaps/GoogleEarth or OpenStreetMap. Software for converting temporal/spatial phylogenies into rich KML files already exist in the BEAST package (see figures above). The primary aim of the software would be to implement a timeline (in days, months or years) over which the epidemic is presented as unfolding. A control allows the user to consider positions with in this timeline to examine the various epidemiological parameter estimates, spatial reconstructions and phylogenetic data simultaneously. The timeline can also be animated to see trends through time.
We propose developing a modular system that easily allows new sources of data to be integrated into the visualization. The minimum set of modules would be the evolutionary tree (phylogeny), a spatial reconstruction on a map, plots of epidemiological data (estimates of number of infected individuals). If time allows, data harvested from other sources could be integrated such as WHO FluNet or Google Flu Trends. The initial project would take the 2009 H1N1/09 'swine flu' pandemic as a case study.
Challenges 
  • Collate and synthesize data from a variety of sources.
  • Visualize data using a unified and adjustable timescale controlled by the user.
  • Adaptation of existing visualizations such as GIS and tree drawing to allow for an animatable time scale.
  • Implementation of a modular system that can be easily extended/adapted for future epidemics
Degree of difficulty and needed skills 
  • Medium
  • Skills would include programming in a web application framework such as GWT (Google Web Toolkit) or production of dynamical webpages using HTML5/SVG/JavaScript.
  • Useful (but not crucial) skills, interests or enthusiasms would include infectious disease biology, evolutionary biology, statistical analysis, database querying, graphical presentation of data and GIS systems.
Mentors 
Andrew Rambaut, Philippe Lemey


Automated submission of rich data to TreeBASE

Rationale 
Currently the submission of a phylogenetic record to the main archival repository, TreeBASE, requires the combination of 1) uploading data (matrix and trees) via a legacy file format (NEXUS) and 2) supplying critical metadata -- accessions, species links, authors, methods -- via an interactive session. Subsequently, these data are accessible within the TreeBASE interface, but metadata are not exportable in any standard way. Thus, the community's key repository of richly annotated phylogenetic records is walled off by custom, interactive interfaces. However, the next-generation phylogenetics data format, NeXML, allows for robust representation of both data and metadata. The aim of this project is to implement key aspects of a NeXML import-export interface to TreeBASE. Implementing submission-via-NeXML, in particular, would reduce the amount of user interaction with TreeBASE and ultimately allow third-party submission tools or phylo inference packages to upload to TreeBASE (which ultimately reduces the burden on TreeBASE of managing the whole user experience).
Approach 
NeXML data can be annotated in a variety of ways, many of which are irrelevant to TreeBASE. In this project you will be working with NeXML data that are annotated correctly with as many annotations that apply to TreeBASE as feasible, e.g. species links, accessions, methods, study metadata such as authors, journal, etc. (You might imagine that such a file is prepared using a yet-to-be-developed external editor such as this one.) The main task is to consume such a file using an off-the-shelf library, traverse the annotations, map them correctly onto TreeBASE's relational model and persist them in the database.
Challenges 
  • The vocabulary and mechanism for representing the respective metadata within NeXML are only just emerging, so gaps in those specifications may surface during the project, and will require consultation with other stakeholders to identify the best way forward. This includes lat/long geo-coordinates, and Genbank accession numbers.
  • TreeBASE is a large codebase, using a multiple-layer technology stack, and making changes to it requires getting familiar with how its pieces work together.
Degree of difficulty and needed skills 
Medium to Hard. The successful applicant needs to be a proficient Java/Spring/Hibernate programmer who can rapidly jump in on a rather large code base. However, the metadata annotations listed above are already supported with the database schema, and so in theory schema changes should not be needed.
Mentors 
Rutger Vos, William Piel

Graphical UI for Composing Workflow Descriptions

Rationale 
The goal of this project is to enable a GUI for creating retrospective phylogenetics workflow descriptions (as opposed to executable workflow action plans). Thousands of phylogenetic trees will be archived this year, along with associated character data. This would be an enormous benefit to researchers, but the data are unlikely to be re-used: potential re-users demand quality, which they assess primarily by examining methods, but the methods annotations do not exist. The goal of this project is to make it easy for users to generate rigorous, computable (though not executable) descriptions of workflows. The resulting tool will become a useful annotation tool as soon as archives can process its output. The basic design requirements are:
  • easy for user to access (web-based) or easy to install and run on any system
  • provides graphical UI to compose workflow description using icons (e.g., drag and drop pipes)
  • accesses wide range of generic and domain-specific workflow concepts (e.g., "input", "multiple seq alignment", "PAUP 4.0", "Newick")
  • generates a formal description of the workflow using ontology terms (e.g., a CDAO instance file in OWL RDF-XML)
Approach 
  • CDAO (Comparative Data Analysis Ontology) will provide a rich source of terms (imported from myGrid ontology and other sources), incluing generic terms like "multiple sequence alignment" (process) and "hand-editing" (process), as well as terms for algorithms ("progressive pairwise alignment"), software ("clustalw"), and so on.
  • various GUI design interfaces are possible. Open-source workflow systems like Taverna and Galaxy have GUIs for workflow design. It might be possible to adapt one of these. Or see http://exon.niaid.nih.gov/mobyleWorkflow/ for Vivek's example using a flow-chart metaphor as implemented in Google Web Toolkit.
  • Mentors will supply use-cases that represent examples of workflows to be described.
Challenges 
  • develop dynamic interface (Drag-n-drop operation and graphical workflow interfaces)
  • understand ontology and manage them from the code.
  • design and implement services and data-models to exchange between server and browser
  • ensure cross-browser compatibility among popular browsers
Involved toolkits or projects 
Degree of difficulty and needed skills 
Experience with the ontology API (like Jena, RDFlib or http://librdf.org/) and with the GUI API (whichever one is chosen for the project) would be helpful. Experience or knowledge with bioinformatics workflows would be helpful.
Mentors 
Vivek Gopalan (main), Arlin Stoltzfus

Interoperable exchange of gene tree reconciliation maps

Rationale 
Example of reconciled tree visualization
In gene tree reconciliation (GTR), one maps the nodes of a gene tree to the nodes of a species tree, and in so doing tells the story of gene duplication and loss, lineage sorting, lateral transfer, and other events in the evolution of the gene family (Page 1997). Despite a recent flowering of analytic approaches to GTR, there are limited tools for storing GTR maps, exchanging them between different software applications, visually exploring them, and comparing the locations of events on the species tree among different gene trees. As part of the iPlant Tree of Life project, a generic gene tree reconciliation database schema and GUI have been developed to allow rich exploration of GTR maps across many gene trees. However, the GUI application (written in Perl) and the database are integrated using ad-hoc interfaces, such that there is not a standard way to drive the GUI with a GTR map that is not from the database, or to output GTR maps from the database for downstream analyses, and this limits the interoperability of these important tools. Thus, a valuable project would be to develop the encoding for a GTR map within a standard phylogenetic exchange format, build support for this object representation within a standard bioinformatics data library, and extend the iPlant Perl API for I/O of these objects.
Approach 
  • Object representation of GTR maps within, e.g., BioPerl
  • Serialization of a GTR map derived from (1) TreeBeST output, e.g. by extension of PhyloXML and its parser in Bio::TreeIO and/or (2) the iPlant GTR database API.
  • A mechanism for resolution of the taxa used by the gene and species trees. A simple approach would be use numeric NCBI TaxIDs for input labels. A more sophisticated could take advantage of iPlant’s taxonomic name resolution service to perform phyloreferencing with taxonomic name strings.
Involved toolkits
Degree of difficulty and needed skills 
  • Medium difficulty.
  • Knowledge of Perl, MySQL, and XML required for API development.
  • Familiarity with web services
  • Familiarity with SQL for topological queries on nested set indexed trees would be an asset.
Mentors 

Export ontology-based phenotype descriptions to the Encyclopedia of Life

Rationale 
Ontology-based phenotype descriptions provide an explicit semantic reference for the anatomical entities and phenotypic qualities being described. This allows phenotypic data to be the subject of complex queries, logical inference, and integration with other data sources. The Encyclopedia of Life provides species pages aggregating available knowledge about the features of all described species. Exposing the contents of ontology-based phenotype database to EOL would allow these semantically rich data to be integrated and browsed with many other data sources.
Approach 
Write code to export suitable ontology-based data stored in OBD, a generic knowledge management system for descriptive biological data into EOL's transfer schema. OBD is the back-end supporting the Phenoscape Knowledgebase, and therefore an ample supply of appropriately encoded phenotype descriptions is readily available.
Challenges 
  • Working with the logical representation in OBD can present a small learning curve.
  • Generalizing the approach to work with other ontology-based descriptive databases would add complexity to the project.
Degree of difficulty and needed skills 
  • Medium difficulty
  • Knowledge of Java and XML programming
  • Understanding of RDF and RDFa would be helpful but could be learned during the project.
Mentors 
  • Jim Balhoff, Chris Mungall, Matt Yoder
  • Cyndy Parr (EOL)

Visualization of Protein 3D Structure Evolution

Visualization of protein 3D structure evolution
Rationale
The evolution of 3D protein structures (as opposed to DNA or protein sequences) is an important (but still relatively poorly studied) subject, for it allows to rationalize functional differences of proteins at the molecular level and enables functional inferences about extant proteins, as well as ancestral ones. In order to aid the usage and analysis molecular structures in a evolutionary context, we propose to develop a visualization application which allows to interactively overlay images of phylogenetic trees with images of 3D protein structures. One intriguing use of such an application is the display and analysis of ancestral, inferred protein structures (for more information an ancestral sequence reconstruction, see here, for example). Similar, but significantly simpler, applications are the overlaying of phylogenetic trees with protein domain architectures (example) or static images (example).
Approach
To aid the study of 3D protein structure evolution, this project aims to extend the Archaeopteryx tree explorer in such a way that it will be capable of: (i) displaying static images of protein structures for each node (which has a 3D structure associated with it) of the displayed tree, and (ii) by selecting/clicking such a static image, a structural viewer (such as Jmol) will be employed to display the protein structure interactively (zooming, rotation, etc.). Ideally, changing the view of a structure in the dynamic viewer (Jmol) will update the corresponding static image, giving the user a way to view all structures in a similar manner (e.g. zoomed into a binding pocket). To describe and store phylogenetic trees overlaid with protein structures, the phyloXML format will be used. Optionally (depending on time and student interest) it might be worthwhile to explore approaches for illustrating (by animation, possibly) the evolution of a given protein structure from its inferred ancestral shape to its present day form.
Challenges
  • The communication between Archaeopteryx and Jmol (or other 3D structure viewer) programs is expected to be the most challenging part of this project.
Involved toolkits or projects 
Degree of difficulty and needed skills 
  • Expected difficulty: Medium-Hard.
  • Proficiency in the Java language, and especially the Java 2D Graphics API is necessary.
  • Experience/interest in structural/molecular biology or biochemistry is required.
  • Experience with Jmol is a big plus.
Mentors 

PhyloPic Enhancements

Rationale 
PhyloPic is an open database of reusable organism silhouettes. All silhouettes are available under Creative Commons licenses, and each is linked to any number of taxonomic names. Names are arranged in a phylogenetic taxonomy, modified from classifications in ClassificationBank. Thus, images can be searched for phylogenetically.
Approach 
We have three possible projects in mind (though other / similar improvements may be suggested):
  1. Cladogram Generator - given an input phylogeny or a set of species names, create a diagram illustrated with silhouettes from PhyloPic. This can be accomplished using one of the PhyloPic APIs. The generator would be a useful tool for authors and educators seeking to quickly generate visual diagrams that communicate phylogenetic hypotheses.
  2. Annotation Support - allow taxonomic names and images to be "marked up" with additional data (age, size, etc.)
  3. Taxonomy Admin Tool - create a web-based tool for editing the taxonomy underlying PhyloPic
Depending on the student's abilities and progress, one or more of these sub-projects could be undertaken for the summer, with other possible extensions. For example, divergence time estimates could be added to the cladogram generator, or annotations could be imported from the Encyclopedia of Life.
Challenges 
  • Creating simple interfaces based on a complex web of relations between images and taxonomic identifiers.
  • Working with a web tool at an early alpha stage; there may be some kinks to work out!
Degree of difficulty and needed skills 
  • Medium to High Difficulty, depending on task
  • Cladogram Generator: familiarity with phylogenetic trees, familiarity with a graphical client-side technology such as Flash (including AMF remoting) or JavaScript (including JSON, AJAX) with SVG or HTML5 Canvas.
  • Annotation Support: HTML/JS/CSS, Python, SQL
  • Taxonomy Admin: HTML/JS/CSS, Python, Flex framework
Mentors 

Automated Application of Phylogenetic Nomenclature

Rationale 

By tying a name to a phylogenetic definition, the taxon indicated by the name is unambiguous under a given phylogenetic context. (See the PhyloCode for more information.) Phylogenetic definitions may be modelled mathematically, which means that they can be stored as MathML. This means that tools can be created that automatically label nodes in phylogenies.

Approach 

Create a command-line tool that reads phylogenetic data and applies MathML expressions, outputting data in a specified format. For example, given these inputs:

  • A phylogenetic hypothesis, encoded as a Newick file. (E.g., (Lacerta agilis, (Ornithorhynchus anatinus, (Didelphis marsupialis, Homo sapiens))))
  • A series of phylogenetic definitions, encoded as a MathML file. (E.g., indicating that Mammalia := CLADE(Ornithorhynchus anatinusHomo sapiens), Theria := CLADE(Didelphis marsupialisHomo sapiens), and Amniota := CLADE(Lacerta agilisHomo sapiens)).
  • A flag indicating that the output format should be Newick.

... the tool should output a Newick file showing the phylogeny with names applied. (E.g.(Lacerta agilis, (Ornithorhynchus anatinus, (Didelphis marsupialis, Homo sapiens)Theria)Mammalia)Amniota)

The logic for evaluating MathML expressions under phylogenetic hypotheses has been created as part of PyMathema. This tool should be built as part of that library.

This project may be extended by adding further file formats (NeXML, RDF, etc.).

More information is available online at Names on Nodes.

Challenges 
  • Working with a partially existing codeset.
  • Creating or co-opting parsing functionality for various file formats.
  • Determining a useful and useable interface.
  • Some phylogenetic definitions require information about apomorphies or whether taxa are extinct or extant. These pose special problems which need to be addressed.
Degree of difficulty and needed skills 
  • Medium Difficulty
  • XML
  • Python
  • understanding of graph theory
Mentors 

Web services for Jalview

Rationale 
Jalview is a multiple alignment editor written in Java. It is used widely in a variety of web pages (e.g. the EBI Clustalw server and the Pfam protein domain database) but is available as a general purpose alignment editor. The aim of this project is to enhance Jalview capability with either phylogenetic analysis package, Mr.Bayes or profile searching package such as HMMER 3. Either should be developed as a JAva Bioinformatics Analysis Web Service (JABAWS) to make it accessible from Jalview or other software systems. See also background for the project.
Approach 
In this project the student will first create a JABA wrapper for either Mr Bayes or HMMER 3 software packages (complete with all necessary tests), and then expose it as a web service, making it available to clients like Jalview. Finally, the code should be incorporated into the JABAWS server and virtual machine packages.
Challenges 
Student will need to build his/her solution in the existing JABAWS framework and so will need to learn this framework and its programming paradigm. A prerequisite will be a broad understanding of technologies from core Java programming to the principles of SOAP web services. The student will also need to create the data structures for handling data appropriate to the task. Finally, both software suits Mr Bayes and HMMER 3 (especially the former) are complex software packages and wrapping their functionality will be a challenge on its own.
Involved toolkits or projects 
Degree of difficulty and needed skills 
Moderate to Difficult. Interested students should have a general knowledge of SOAP web services and experience in Java-based software development and some understanding of either phylogenetic analysis or profile based similarity searches.
Selection criteria (not exhaustive list, but will give you some guidance) 
  • Quality of proposal (feasible given your skills and available time, useful outcome)
  • Interaction (response to emails, back-and-forth regarding proposal)
  • The coding exercise is optional but it would help your application if you write a parser for HMMER 3 hmmbuild output.
Mentors 
Peter Troshin, Jim Procter

PhyloGeoRef: A Java library for mapping phylogenetic trees and geographical information in KML

Rationale
Many phylogenetic studies include globe spanning sample sites. It is useful for researchers to be able to visualize the relatedness of their organisms of interest both evolutionary and geographically. The goal of this project would be to extend a geo-referencing library started in 2010 for parsing and displaying geographical and phylogenetic information. The main aim would be to bring together geographical and phylogenetic information in a way that is usable and useful to the user. Given an input of a tree file and geographical information the library will return a tree with phylogenetic and geographical information in a format such as KML, shapefile or a geo-referenced phylogenetic format such as NeXML.
Approach 
The PhyloGeoRef project began in 2010 as a Google Summer of Code project. Among the most needed but still missing pieces are support for different filetypes including NeXML, and a way to integrate with databases. Also, several parts need further improvement to increase their utility, such as visualization of very large, globally spanning trees; additional statistics about the phylogenetic trees and organisms on it; and updates to file format parsers to reflect changes in those formats. Some possible approaches to solving these issues are outlined in a blog post.
Challenges 
The end product of this library is a very visual KML file for Google Earth. The more information that can be cleanly and clearly displayed the better, and thus someone with a good eye for data visualization would be ideal.
Involved toolkits
  • The library is written in Java. There's nothing too complex at the moment so if you don't know Java but are already familiar with an object oriented language, you could still hit the ground running with enough advance preparation.
  • JAXB is used for writing the KML.
  • KML is a big plus!
Degree of difficulty
Easy to medium. At a minimum knowledge of an object oriented programming language, and Java in particular, is required. At least basic knowledge of phylogenetics will help significantly.
Mentor
Kathryn Iverson, David Kidd

Improving ape

Rationale 
Since its first release in 2002, ape has become widely used by evolutionary biologists for data analysis, and by developers who write their packages in R. Currently, 33 packages on CRAN depend or suggest ape (plus a dozen others on R-Forge and an unknown number of R scripts available on personal webpages). Over the years, the data structures introduced in ape have become implicit standards, particularly the class "phylo" for phylogenetic trees. The goal of this project idea is to help stabilize and optimize internal functions and data structures in ape.
Approach 
A project proposal for this idea would best focus on two areas:
  • Input/output of phylogenies from/to Newick and NEXUS files. One goal is to improve speed through use of C code: this has already been done but this works only in some cases. The objective here is to generalize the use of C code in other situations. A second goal would be to support parsing additional information that is often within square brackets. The latter characters are in fact used for comments in the Newick syntax, but some programs use these to output supplementary information (e.g., posterior probability parameters). A related target for improvement is better diagnostics.
  • Plotting. The main objective would be to improve the practical functionalities. This will mainly involve a reorganization of the existing R and C code used to plot trees in ape. The specific targets are:
    • computing the coordinates of nodes without plotting;
    • plotting tree(s) on top of an existing graph;
    • plotting trees with the grid graphical package;
    • explore how to do rotation/scaling of the whole tree.
    • improve cophyloplot().
Challenges 
Because ape is used in a wide range of applications, it needs to be stable and fast. There are two main challenges that result from this: think about the objectives in terms of applicability for a wide range of users, and (re)code efficiently where needed. The "vision" of the proposed project will be as important as the programming skills to implement it, and determining the right implementation will require synthesizing points of views from different users and other stakeholders (such as package developers).
Involved toolkits or projects 
ape, and of course R
Degree of difficulty and needed skills 
  • medium to hard
  • programming skills: R and C (or C++ instead of C)
Mentor 
Emmanuel Paradis


Putting order in the ape edge matrix

Rationale 
ape is a widely used R package for phylogenetic and comparative methods in R. Phylogenetic trees in ape are coded as objects of class "phylo". The 'edge' matrix is the basic element of this data structure: it is a two-column matrix of integers where each row represents a branch of the tree. The ordering of these rows is important to do efficient computations and manipulations. The issue of this ordering has not yet been fully explored, and this could explain some of the problems and limitations encountered with ape. This project idea proposes to address the present issue in a more exhaustive way in order to propose a complete scheme for the ordering of edges.
Approach 
  • Valiente [1] describes formally the different orderings of edges in a tree, as well as various algorithms. This would serve as the theoretical basis for this project.
  • Schemes already exist for ordering of edges in ape and phangorn (also an R package for phylogenetic analysis). The emphasis should be on efficient computations in these two packages.
[1] Valiente, G. (2009) Combinatorial Pattern Matching Algorithms in Computational Biology Using Perl and R. Chapman & Hall/CRC.
Challenges 
The project involves formal concepts on algorithmics and a main challenge will be to link them with practical applications and programming in R and C. Because ape has become a widely used package, it is challenging to design functions and data structures that will be usable by a wide spectrum of users.
Involved toolkits or projects 
ape, phangorn. and of course R
Degree of difficulty and needed skills 
  • medium to hard
  • programming skills: R and C
Mentors 
Klaus Schliep and Emmanuel Paradis

DIM SUM 2: GPU computing for an individual-based simulator of movement and demography

Example of DIM SUM simulation on the Hawaiian islands
Rationale 
Phylogeographic inference methods have been changing and growing rapidly in recent years. However, simulation-based testing of these methods has lagged behind the development of inference methods due to the lack of flexible simulators. DIM SUM is a recently released simulator of phylogeographic histories based on the movement and reproduction of individuals on a continuous landscape. While very general and flexible, the current implementation of DIM SUM can be quite slow as the number of individuals increases or the resolution of the landscape becomes more refined. The nature of the simulations lends itself quite naturally to massive parallelization through the use of GPU's, potentially leading to huge gains in speed.
Approach 
Simulating individual reproduction and movement involves two time-consuming steps that could be parallelized First, individuals draw a random number of offspring who then pick a random dispersal direction and distance. Next, local population density is maintained at a carrying capacity by removing individuals from parts of the landscape where the population density has become too high. Both of these steps involve 10s to 10,000s of operations that can be carried out independently. The program is currently written in Java, so GPU-based parallelization would involve use of the jCUDA library. A particularly ambitious student could port the code to C++ and use the native CUDA libraries, although the amount of work involved in a complete port might not be feasible in 12 weeks.
Challenges 
  • Streamlining code that is not extensively commented
  • Efficiently transforming serial operations to a parallel framework
  • Making optimal use of CUDA and jCUDA libraries.
Involved toolkits or projects 
Degree of difficulty and needed skills 
Medium to Difficult. Previous Java experience is required, and previous experience with GPU programming would be helpful.
Required coding exercise to be included with student applications
To demonstrate a working knowledge of Java and CUDA (through jCuda), please write a simple program that takes two vectors of 100,000 random integers between 0 and 10,000 and multiplies the entries in a pairwise fashion by distributing the computations with jCuda.
Mentors
Jeremy Brown and Kevin Savidge

BioRuby forester

Rationale 
Forester is a collection of software libraries, mostly written in Java, for comparative genomics and evolutionary biology research. A prominent example of a tool based on forester is the phylogenetic tree explorer Archaeopteryx. Most of forester's use-cases are associated with the use of evolutionary trees as tools for establishing (functional) relations between genes or proteins (for example protein function prediction with RIO) and comparing genome based features between different species. Therefore, it implements objects representing evolutionary trees overlaid with biological data from other sources (e.g. protein domain architectures), as well as algorithms operating on these, such as the automated inference of ancestral taxonomies on gene trees, which has proven useful in the functional interpretation of large gene trees.
Most of these methods are currently only accessible via the command-line or through the GUI of Archaeopteryx and therefore difficult or impossible to use from other computer programs or toolkits (such as BioRuby). Although forester is mostly written in Java, it also contains components in Ruby ("evoruby"). These implement operations on multiple sequence alignments (MSAs) that are crucial in the development of workflows for automated, large scale, phylogenetic inference, including I/O, and efficient MSA manipulation (such as deletion of all columns with a gap-portion larger than a given threshold, removal of short and/or redundant sequences).
Approach 
The goal would be to develop a framework for accessing forester's central algorithms and applications from within BioRuby. It is expected that this project will be implemented in form of a BioRuby plugin in order to avoid creating additional dependencies for the main BioRuby distribution. Full two-way access between the Java and Ruby languages can be accomplished by using JRuby as the underlaying platform.
Depending on the level of experience and skills of a student, a project proposal could also include either or both of the following additional goals.
  1. BioRuby and the "evoruby" components of forester partially overlap in functionality. You could incorporate MSA management functionality present in "evoruby" but missing in BioRuby into the BioRuby distribution. This would not only make that functionality immediately accessible to all BioRuby users, but would also allow a larger community of developers to participate in maintentence and future development of these components.
  2. Display gene conversions. This would entail developing a parser for GENECONV output and use the newly developed BioRuby-forester link to directly display gene conversions within Archaeopteryx.
Challenges
  • The student needs to learn two disparate toolkits, BioRuby and forester.
  • The project involves two programming languages, Ruby and Java.
  • Need to understand the BioRuby plugin system.
Involved toolkits or projects 
Degree of difficulty and needed skills 
Expected difficulty: Medium. Proficiency in at least one of the two involved programming languages, Ruby and Java, is necessary. Experience/interest in molecular evolution or comparative genomics is required, and experience with BioRuby or forester will help.
Mentors 
Christian Zmasek, Pjotr Prins, Raoul J.P. Bonnal

Extending Jalview's support for handling RNA

Rationale 
RNA is a hot topic in bioinformatics, and during the 2010 summer of code, Lauren Lui added new support for RNA to Jalview. Her work enabled it to import and visualize RNA family alignments, colour them by their helix/stem structure and properly displaying the WUSS secondary structure annotation for an RNA family. However, Jalview still doesn't have any way of visualising consensus correlated base pair motifs, working with RNA sequence database search tools, or visualizing RNA secondary structure diagrams.
Approach 
Build on the work of Lauren Lui during her 2010 Google Summer of Code project (see her blog for details) and extend it to allow Jalview to calculate RNA base pair conservation scores using secondary structure annotation on an RNA alignment. Then, either look at implementing web service clients for accessing RNA secondary structure prediction services, or incorporating the VARNA secondary structure visualization program.
Challenges 
  • Extending Jalview's datamodel with efficient structures for handling secondary structure data
  • implementation of new interactive alignment analysis routines
  • development of SOAP or REST web service clients
  • integrating one Java based interactive visualization tool into another
Degree of difficulty and needed skills 
Medium to Hard, depending on path taken. Requires Java programming experience (file handling and parsing, Swing GUI development, Web services, Data models), and familiarity or interest in RNA sequence analysis will certainly help.
Alternative ideas 
There are other desirable areas of functionality that are presently lacking in Jalview. If RNA isn't to your taste, take a look at the list of potential Jalview projects.
Mentors 
Jim Procter

Mentors

What should prospective students know?

Before you apply

  • Look at our project ideas to get an overall sense for the kinds of projects we support.
  • Pick the idea that appeals most to you in terms of goals, context, and required skills, or you can apply with your own project idea.
  • If you want to apply with your own idea, contact us early on so we can try to find a mentor. Alternatively, you can suggest a candidate mentor to us if you know one.
  • Ask us questions about the project idea you have in mind.
  • Write a project proposal draft, include a project plan (see below), and bounce those off of us.

Have I mentioned yet that you should be in touch with us before you apply?

When you apply

When applying, (aside from the information requested by Google) please provide the following in your application material.

  1. Your interests, what makes you excited.
  2. Why you are interested in the project, uniquely suited to undertake it, and what do you anticipate to gain from it.
  3. A summary of your programming experience and skills
  4. Programs or projects you have previously authored or contributed to, in particular those available as open-source, including, if applicable, any past Summer of Code involvement.
  5. A project plan for the project you are proposing, even if your proposed project is directly based on one of the ideas above.
    • A project plan in principle divides up the whole project into a series of manageable milestones and timelines that, when all accomplished, logically lead to the end goal(s) of the project. Put in another way, a project plan explains what you expect you will need to be doing, and what you expect you need to have accomplished, at which time, so that at the end you reach the goals of the project.
    • Do not take this part lightly. A compelling plan takes a significant amount of work. Empirically, applications with no or a hastily composed project plan have not been competitive, and a more thorough project plan can easily make an applicant outcompete another with more advanced skills.
    • A good plan will require you to thoroughly think about the project itself and how one might want to go about the work.
    • We strongly recommend that you bounce your proposed project and your project plan draft off our mentors by emailing phylosoc@nescent.org (see below). Through the project plan exercise you will inevitably discover that you are missing a lot of the pieces - we are there to help you fill those in as best as we can.
    • Note that the start of the coding period should be precisely that - you start coding. That is, tasks such as understanding a code base, or investigating a tool or a standard, should be done before the start of coding, during the community bonding period. If you can't accomplish that due to conflicting obligations during the community bonding period, make that clear in your application, and state how you will make up the lost time during the coding period.
    • For examples of strong project plans from prior years, look at our Ideas pages from those years, specifically the Accepted Projects section (e.g., 2008, 2009). Visit the listed project pages, many have the initial project plan (which was subsequently updated and revised, as the real project unfolded). Good examples are PhyloXML Support in BioRuby, PhyloXML support in Biopython, or Mesquite Package to View NeXML Files, all 2009 projects.
    • We don't expect you to have all the experience, background, and knowledge to come up with the final, real work plan on your own at the time you apply. We do expect your plan to demonstrate, however, that you have made the effort and thoroughly dissected the goals into tasks and successive accomplishments that make sense.
  6. Your possibly conflicting obligations or plans for the summer during the coding period.
    • Although there are no hard and fast rules about how much you can do in parallel to your Summer of Code project, we do expect the project to be your primary focus of attention over the summer. Our view is that it should be the same as an internship you take up over the summer, except this one is remote (and almost certainly more fun :-) If you look at your Summer of Code project as a part-time occupation, please don't apply for our organization.
    • That notwithstanding, if you have the time-management skills to manage other work obligations concurrent with your Summer of Code project, feel encouraged to make your case and support it with evidence.
    • Most important of all, be upfront. If it turns out later that you weren't clear about other obligations, at best (i.e., if your accomplishment record at that point is spotless) it destroys our trust. Also, if you are accepted, don't take on additional obligations before discussing those with your mentor.
    • One of the most common reasons for students to struggle or fail is being overstretched. Don't set yourself up for that - at best it would greatly diminish the amount of fun you'll have with your Summer of Code project.
  7. If the project idea you are applying for suggests or requests completion of a coding exercise, provide a pointer to your results (such as code you wrote, output, test results). Not all project ideas have this requirement; if in doubt, ask by emailing the mentors list.

Get in touch with us

Please send any questions you have and ideas and work plans for projects you would like to propose to phylosoc@nescent.org. This will reach all mentors and our organization adminstrators.

  • We strongly recommend you do this even if you want to work directly on one of our project ideas above. It gives you an opportunity to get feedback on what our expectations might be, and you might want to ask for more specifics.
  • Also, you can bounce your project plan off of us. In the past those students who engaged with us before submitting their application invariably had vastly better applications and project plans.
  • Keep in mind that we are most interested in students who give us evidence that they have already or might develop a sustained interest in becoming future contributors to our open source projects. How do you convey that if the first time we see something from you is when we score your application?
  • The value of frequent and early communication in contributing to a distributed and collaboratively developed project can hardly be overemphasized. The same is true for becoming part of a community, even if only temporarily. If you can't find the time to engage in communication with your prospective organization before you have been accepted, how will you argue that you will be a good communicator once accepted?

If you are interested in hearing from our previous Summer of Code mentors or students, you can also send email to wg-phyloinformatics@nescent.org. It's a larger list, so in that case try to be specific about what kinds of people you'd like to hear back from.

Other information

Reference Facts & Links: Google Summer of Code 2011

  • Mentoring organizations applied between Feb 28 - March 11, 2011.
  • Students apply online between March 28-April 8, 2011. The eligibility requirements for students are in the GSoC FAQs.
  • After close of student applications, the mentors of the organizations score the applications, and Google allocates slots to each organization. The top scoring N students for each organization, N being the number of slots allocated, get accepted and announced on April 25, 2011.
  • The coding period is from May 23 to August 22, 2011. Between acceptance and the start of coding, students are expected to acquaint themselves with the project, the code bases involved, and the respective developer community, as well as set up everything needed for developing and committing (code repository access, mailing list subcription, software and dependencies installation, compiling, etc).
  • Development occurs on-line, there is no requirement or expectation to travel, neither for students nor for mentors.