Phyloinformatics Summer of Code 2014

From Phyloinformatics
Jump to: navigation, search

Unfortunately, NESCent didn't get accepted as an Organization for the GSoC 2014.

If you are a student interested in working on one of the project ideas below, or any of the others on this wiki, then you should contact the mentor, and submit your application to one of the bioinformatics-related organisations in this year's summer of code. Links to their melange pages are given below:

  • Open Bioinformatics Foundation Umbrella organisation for the Bio-* (python, java, perl, ruby, lib, sql, emboss) projects, and sister organisation to NESCent.
  • Ruby Science Foundation Ideas currently include matrix analysis, cross-language interfacing and D3 based visual analytics and a Ruby Notebook (Thanks to Pjotr Prins for the recommendation!)
  • bio4j graph-based bioinformatics data integration platform
  • BioJS JQuery based framework for interactive biological data visualisation components in Javascript
  • National Resource For Network Biology All-things biological network/Systems biology related (Cytoscape(.js)?, WikiPathways, pathvisio(.js)?, etc.)
  • Shogun Machine Learning Toolbox core interest is Machine Learning algorithms but applications include computer vision and bioinformatics
  • Centre for Computational Medicine at SickKids, Toronto - primary developers of Savant/MedSavant and Phenotips. Project ideas include visualizing the evolution of tumors.

There are a number of science related organisations in GSoC 2014. Here are some more below that might interest you:

Medical Informatics, Imaging and Records

Math-oriented orgs:

  • lmonade Umbrella org for a host of symbolic math image and other analysis tools and libraries
  • Octave Another favourite symbolic math package with MatLab style features
  • SymPy Symbolic Math in Python
  • PRISM Model Checker - a platform for probabilistic modelling of systems (ideas include systems biology applications)

Visualization related orgs:

  • The Visualisation Tool Kit (VTK) includes some ideas for supporting basic biochemistry data visualization, and ranges through to high-performance renderer porting/implementation


Background on NESCent

Rather than mentoring students in a single open-source project, NESCent is an umbrella organization for many projects in evolutionary informatics. This means that NESCent projects use a wide variety of technologies and programming languages. We also participate in the GNOME Outreach Program for Women, which runs in parallel with Google Summer of Code.

How does GSoC work?

Very briefly, Google sponsors students to spend the summer working on open-source programming projects with mentoring organizations. NESCent is one such organization. Google provides tons of documentation about GSoC:

For mentors

If you fancy being a mentor or co-mentor:

  1. Sign up for the phyloinformatics mailing list, if you aren't already on it
  2. Join the Phyloinformatics community on Google Plus.
  3. Add your name to our list of mentors
  4. Pitch your project idea on mailing list
  5. Add your idea to the wiki, in the Project Ideas section of this page

For prospective students

It is too early in the program to submit project proposals, but it is not too early to start contacting us and thinking about project ideas. Here are some things you can do now:

  • sign up for the phyloinformatics mailing list and introduce yourself
  • post on the G+ community page
  • read through the Project Ideas as they get posted below to see what projects you might be interested in
  • check out the source code of a project that interests you
  • make a contribution to an open-source project
  • read the GSoC FAQ and student guide

Contact information

Project Ideas

To add a project idea: Sign in, and then click the edit link to the right of this section. There is a template in comment tags. Check out the tips and guidelines on creating project ideas. Please also pitch you idea on the mailing list, so that we can get some discussion going about the various project ideas. The aim is to finish the discussion of project ideas at the org application phase of GSoC, so that when we review student proposals, we can review the proposal without discussing the merits of the project idea.


Computer vision of phylogenetic trees

Rationale 

> 10,000 phylogenetic trees are published each year but very few (ca 4%) are supported by machine-readable data. Most publications are as images, either in PDFs or bitmaps (pixels). Because of the standardized representation, mainly graphics strokes and characters, it is possible to recreate the trees from the images, therefore providing a huge amount of valuable reusable data. Typical examples of trees are (Open access):

The primary task is to interpret these images as trees consisting of nodes and edges. The leaf nodes are normally labelled with the species (e.g. Homo sapines); the branch nodes may or may not be labelled with weights or some other annotations. The edges are represented by (straight) lines, sometimes with labels. The length of the line is often important. Many trees are relatively simple (lines and labels, but some have extra annotations such as pictures of species.

Computer vision is a very well researched field and this task is relatively easy as the images are born-digital with a small range of colours and no noise. However all CV requires heuristics and parameterisation specific to the field. It may be useful to develop scripts or more complex frameworks for trying different approaches automatically.

Our primary approach is Java and a set of image processing algorithms is in http://www.bitbucket.org/petermr/imageanalysis. This is part of a wider framework for reading PDFs and extracting all components. http://www.bitbucket.org/petermr/ami2 is a summary of the (Java) libraries. The result of this extraction is either a set of images (PNGs) or SVG diagrams depending on the original document which are then to be analysed.


Approach 

Either manually or using a classifier on the captions we extract a set of images that have trees. We analyze these for standard geometrical patterns such as right-angled lines and aligned text. Alternatively we can create trees with floodfill of pixels. The characters are recognised with OCR or already present and reassembled into words (such as species names). The tree and its tips are then assembled from the components.

The subtasks may include but are not limited to standard image processing tasks:

  • binarization and equalisation
  • line detection (Hough/Canny/Douglas-Peuckert)
  • OCR or lookup
  • floodfill/convex_hull
  • thinning and fattening


Challenges 

There is no single way to solve this and different images may require different tools. Since the images are born digital the primitives are clean (no noise). The project should fit well onto an introductory course on Computer Vision. There are relatively few programs used to generate images so it should be possible to recognize them and create heuristics. Challenges include:

  • creating a simple workflow allowing different approaches
  • font and character recognition
  • abstracting characters or creating/integrating simple learning machines for OCR. We have a novel approach which does not use machine-learning which students may like to explore
  • metrics for recall and precision

There is flexibility in the challenge. Initially we provide some clear tasks such as line segment detection or OCR. Then the student may either wish to tackle other clear tasks or develop their own extensions to the algorithms and heuristics.

Involved toolkits or projects 

PDF extraction and manipulation uses http://www.bitbucket.org/petermr/ami2 . We prefer a 100% pure Java solution (the most problematic area is character OCR and this would be a useful isolated subtask). The target output is modelled with SVG and XHTML for which we provide Java DOMs/XOMs. All projects are mavenised and use Jenkins for the continuous integration.

Degree of difficulty and needed skills 

Little specific bioscience required, so good for a compsci wanting to work in this area. Java essential and maven/Junit very useful. Knowledge of simple 2D-computer vision and image manipulation needed. SVG and XML.

Mentors 

Both mentors are active members of the Open Knowledge Foundation and its community. They are starting work on a project (PLUTo) to extract trees from the literature of which image analysis is an important part.

  • Peter Murray-Rust is a scientist with many virtual projects including http://www.blueobelisk.org and has 20 years of Java, XML and SVG experience and projects in natural language processing for science.
  • Ross Mounce is an evolutionary biologist graduated from the University of Bath in a group creating and reconstructing phylogenies.

Automated partitioning for morphological phylogenetics

Rationale 

PartitionFinder is software that co-estimates partitioning schemes and models of molecular evolution for phylogenetics. Partitioning allows different sets of sites in an alignment to be modelled independently, and choosing the right partitioning scheme can improve the estimation of phylogenetic trees, rates of evolution, and dates of divergence. Partitioning is an important step in phylogenetics, especially with large multi-locus datasets, and PartitionFinder is becoming increasingly widely used:

https://github.com/brettc/partitionfinder http://mbe.oxfordjournals.org/content/29/6/1695

Partitioning is routine for DNA and amino acid datasets, but is rarely used for morphological datasets. This is probably because it's hard to think of ways to divide a morphological matrix into sensible partitions. This project aims to address this, by providing methods to automatically partition morphological alignments.

In last year's GSoC, we implemented a method that automatically partitions alignments based on the features of the alignment itself, without the need for users to define partitions a priori. The aim of this project is to build on this approach to partition morphological data matrices. This would provide an objective and user friendly method of partitioning morphological matrices, and the hope is that it would then be simple for researchers to estimate and implement user-friendly partitioned models for morphological analyses.

Approach 

This project would involve implementing and testing methods to partition morphological alignments in PartitionFinder. The basic idea is to use RAxML's implementation of the MK model. This has some benefits (e.g. it's fast), and some drawbacks (e.g. it can't deal with polymorphic/uncertain sites, and it requires the characters in an alignment to include zero and not skip any integers). The details of the implementation are, of course, flexible. For example, it might be better to use a different impementation of the MK models. This would be for the student to choose.

Challenges 

There would be five challenges:

  • Writing and testing a parser and alignment writer for morphological data matrices;
  • Writing and testing code to re-factor morphological matrices (necessary becuase PartitionFinder will need to analyse matrices that are subsets of the input matrix, and RAxML has strict requirements for properties of each matrix);
  • Implementing morphological models in PartitionFinder and testing the code;
  • Integrating morphological models into the interative k-means method in PartitionFinder;
  • Test the method on a series of real and simulated datasets.
Involved toolkits or projects 

PartitionFinder Python

Degree of difficulty and needed skills 

Moderate. You would need to be familiar with Python, and familiar with (or willing to learn) a bit of morphological phylogenetics (i.e., you need to know/learn exactly how maximum likelihood models of morphological evolution like the MK model work).

Mentors 

The mentors are a combination of a molecular-evolutionary biologist and Python programmer (Rob Lanfear), and a former professional Python programmer turned philosopher and molecular biology software programmer (Brett Calcott).

  • Rob Lanfear
  • Brett Calcott


Implement Haskell Parsing/Desugaring/Optimizing in C++

Rationale 

BAli-Phy is evolutionary biology software for estimating alignments, phylogenies, and other evolutionary parameters (github). It contains an embedded Haskell interpreter with the special property that it tracks the execution of the program. This allows it to cache results for expensive computations and recalculate them when certain "modifiable" variables change. Evolutionary models are represented as Haskell programs, allowing extraordinary expressive power. The implementation could benefit from a better parser, better translation into the "kernel", type inference, and optimizations such as inlining. This is a great opportunity to learn about the implementation of functional languages.

Approach

Students can select pieces the 4 sub-projects and combine them to create a project that they would like to work on. One odd feature of this project is that we are implementing Haskell in C++, not in Haskell. This is similar to the Javascript Haskell Compiler (JSHC).

Challenges

1. Haskell Parser: Add layout mode.

Haskell uses whitespace, such as indentation, to communicate block boundaries. It does this by translating indentation into "{", ";", and "}" tokens during parsing and lexing. As a result, Haskell has a multi-stage # lexing/parsing procedure:

  1. Lexer: find tokens
  2. Lexer: modify the token stream to add indentation tokens
  3. Lexer: modify the token stream to add "{", ";", and "}" tokens
  4. Parser: while parsing, if a block does not parse, try adding a "}" token.

This last stage allows the statement let a=2;b=3 in a+b to be translated into let {a=2;b=3} in a+b. BAli-Phy currently contains a lexer and parser written in BOOST Spirit, but it does not insert any "{", ";", or "}" tokens and thus ignores indentation.


2. Translation into Haskell kernel (Removing syntactic sugar)

After parsing, Haskell syntax is translated into a subset of the original language called the "kernel". The kernel contains a much smaller number of operations and is easier to execute. For example, [1..] (indicating an infinite list of integers starting at 1) is translated into enumFrom 1.

Challenges here include the translation of complex case statements into simple case statements. For example, case x of (a,(b,c)) -> a+b+c would be replaced by case x of (a,z) -> case z of (b,c) -> a+b+c. Translation of complex case statements would include the translation of "as patterns" (@), "irrefutable patterns" (~), and "pattern guards" ((x:y) | even x) -- none of which are currently implemented. The translation should be accurate, but should also be efficient.

3. Implement Hindley-Milner type inference.

Haskell is a statically type language. However, unlike C and many other statically typed language, Haskell is able to infer the types of variables from how they are used. For example, for the function f x = x 1, the argument x must be a function that takes an integer as an argument.

Haskell uses the Hindley-Milner type system. The HM type system is able to represent functions whose arguments are also functions. For example, the x mentioned above would have type Int->a. Here a is a type variable, and this indicates a function from integers to any other type a.

The student may or may not wish to incorporate "type classes" into the algorithm. Either way is acceptable, although if type classes could be incorporated, that would be great.

4. Optimization of Haskell programs

The efficiency of Haskell code can be greatly improved by a combination of inlining and constant propagation. Inlining functions can also be helpful. For example, if we have let f x y = (x,y) in case (f 1 z) of (a,b)->a+b, then inlining allows us to replace f 1 z with (1,z), yielding case (1,z) of (a,b)->a+b. We are then able to replace this expression with 1+z. This increases efficiency, since we have eliminated both a let operation and a case operation.

A method of performing these optimizations is described in the paper Secrets of the Glasgow Compiler inliner. This would speed up code that uses the $ operator, and allow inference to run faster.

Involved Toolkits or projects
Degree of Difficulty and needed skills
  • The programming project does not require biological knowledge.
  • Medium - Hard
  • Some prior knowledge of functional languages and/or parsing/lexing is probably necessary.
Involved Develop Communities

Mainly this involves BAli-Phy, and perhaps the Haskell community.

Mentors

Anchoring Biodiversity Collections to Phylogenetic Data

Rationale 

Anchoring biodiversity museum collections to a framework of known phylogenetic relations would allow for searching a museum collection from a phylogenetically informed perspective as well as provide a mechanism to make direct use of phylogenetic information for a given set of museum specimens. This platform would be useful to curators working with museum collections as well as provide a mechanism to generate on-the-fly tree visualizations for education and outreach in museum exhibits.

Approach 

The goal would be to develop a command line implementation or web API that allows for users to query a database of museum objects that have been mapped to a phylogeny. If time allows, this framework could be wrapped in a web GUI that provides an interface to the underlying query environment. The supporting framework would need to make use of a data store to link museum objects to their phylogenetic context. This framework could be based on a RDMBS data store such as BioSQL or Chado that includes a mapping of museum objects to trees stored within the database, or it could make use of a suite of RDF data stores that can be accessed in federated queries. Coding the interface would ideally make use of Perl, but could be developed in an alternative language with the help of additional mentors.

Challenges 

Anticipated challenges include:

  • Developing the architecture for storing and accessing museum objects mapped to nodes in a phylogeny
  • Reconciling species names as used in biocollections to species names used in available phylogenies
Involved toolkits or projects 

The direction of the student led proposal would set the scope of the specific toolkits that would be required. Available toolkits that could contribute to the project include SQL/SPARQL, BioSQL or Chado, BioPerl/BioPhylo, and phylotastic. Relevant projects include:

Degree of difficulty and needed skills 
  • Medium Difficulty
  • Knowledge of SQL/SPARQL and Perl (or other language for developing the interface)
  • Understanding how to store trees within SQL data stores or RDF stores would be required. This could be learned as part of the onboarding process between the time of student acceptance and the start of coding
Involved developer communities

The developer communites that would be helpful over the course of project development include BioSQL or Chado, BioPerl and/or Bio-Phylo . The final repository of the project would depend in part on design decisions led by the student coder.

Mentors 

Developing Javascript tools for teaching phylogenetics

Rationale 

Several concepts in phylogenetic inference benefit from interactive software demonstrations. Many software technologies could be used to provide such tools, but a set of tools in javascript seem to have the widest audience (as opposed to an iOS app, an Android app, or something written for desktop machines).

The goal of this project would be to build a small javascript library for drawing small phylogenetic trees and character matrices. The key feature would be providing have a simple API so that other developers could add widgets that respond to actions by the user to change the tree or data.

This would allow other people to write demos that show how phylogenetic trees affect a wide variety of comparative analyses. Users of the library would just need to write a few callback functions for handling changes to the tree and data, and they would not have to learn alot about JS rendering of these object or user interactions with those inputs.

Approach 

Mark Holder (MTH) has a couple of use-cases that could serve as the focal "applications" for the project, but it would be great if most of the code were written in a modular way that would allow people who can write a little JS to add new tree- or data- responsive widgets.

Use case demos:

1. pattern probability and likelihood vizualizer. The tool displays the probability of each of 8 data patterns for a 4 taxon tree under the simplest phylogenetics model (CFN model). As the user drags the nodes of the tree, the branches change length. This causes the probabilities to change. You can upload a data set and visualize the frequency of patterns in the data set. Pattern frequencies (from the tree and the data) are displayed as a grouped histogram. Maximizing the likelihood is a process of getting the patterns predicted by the tree to match the patterns in the data as closely as possible. The students can see the likelihood change, and see how certain combinations of branch length can lead to potentially misleading signals in the data.

2. Mark Holder's lecture at Woods Hole on phylogenetic topology testing has a series of slides (see slides 70-83 of HolderTestingLectureWoodsHole2013.pdf on [1]) that several students have told him were useful in visualizing how different resampling and topology tests work. These would be much more effective as an interactive tool (and with better design in general). The slides depict a slice of the same pattern frequency space as discussed in example 1, but displayed in barycentric coordinates. Ideally, a widget would show how trees map to the space (slides 72 and 73) interactively as the user dragged the branches of the tree. The slides depict the effect of bootstrapping (slides 79 and 80) ineffectively as a contour plot; generating bootstrapped samples and plotting the points would be much more effective.

Challenges 

Writing an intuitive API (even a small one) is always tough. Ideally, the library should assumpe that the user knows very little javascript. I think there are several people who teach phylogenetics and program some who would be willing to learn how to draw a plot (and refresh the plot when the inputs change), but do not want to learn anything about how to write an interactive tree widget.

The mentor (MTH) can code in JS a little, but is not an expert at all (in that sense he is a good naive, test user).

Involved toolkits or projects 

MTH is not up-to-date with the prior art in JS for interactive tree widgets

For use case 1, there is existing code. It should not be used directly, but could serve as inspiration. Mark Holder (MTH) uses a python app to demonstrate how likelihoods change in response to branch length changes. But that app is hard to distribute, so students cannont interact with it (they just watch MTH). briefly tried to port the tool to JS, but his JS-skills are not mad JS skills, and he regards the nodes-ed.js library as a failure.


Degree of difficulty and needed skills 

Moderate, the student would need to be relatively experienced in JS

Involved developer communities

?

Mentors 

Expanding an Open Tree of Life client library in Python

Rationale 

The Open Tree of Life project has a large set of curated phylogenetic hypothesis, and a large supertree with the goal of representing all species. The project is implemented using a web service oriented architecture.

A client-side python library (called peyotl) for interacting with these web services has been started, but it is in its infancy. Ideally this tool would allow interoperability with a more widely used bioinformatics libray in python (such as DendroPy, BioPython, or PyCogent)

Approach 

Add more python functions that wrap calls to the Open Tree APIs and more adaptors between other python toolkits

Challenges 

The Open Tree APIs are in some flux and not fully documented (but see this page)

Involved toolkits or projects 

1. Open Tree of Life and peyotl

2 - N? whichever python libraries the student chooses to adapt the

Degree of difficulty and needed skills 

Low-Moderate

Involved developer communities

Open Tree of Life uses #opentreeoflife IRC channel on freenode.

The other communities depend on which libraries the student focuses on. The Arbor project may be relevant.

Mentors 


Add taxonomic name resolution to the EcoData Retriever to facilitate data science approaches to ecology

Rationale 

The EcoData Retriever is a Python based tool for automatically downloading, cleaning up, and restructuring ecological data. It does the hard work of data munging so that scientists can focus on doing science.

One of the challenges of ecological (and evolutionary) data is that the names of species are constantly being redefined. This makes it difficult to combine datasets to do interesting science. By automating reconciliation of different species names as part of the process of accessing the data in the first place it will become much easier to combine diverse datasets and in new and interesting ways.

Approach 

This project would extendthe EcoData Retriever using Python to access one or more of the existing web services that reconcile species names (e.g., iPlant's Taxonomic Name Resolution Service) and automatically replacing the species scientific names in the generated databases. Specifically this would involve:

  • Object oriented programming in Python
  • Using Python to query web service APIs
Challenges 

Scientific names are stored inconsistently across datasets, so it will be necessary to either modifying the scripts that hold information on each dataset to indicate the location of the species information or use an existing ontology to automatically identify the location.

Involved toolkits or projects 
  • The EcoData Retriever
  • Python
  • The APIs for the taxonomic name resolution services.
  • Relational database management systems (RDBMS) including MySQL, PostgreSQL, SQLite
Degree of difficulty and needed skills 
  • Moderate Difficulty
  • Knowledge of Python
  • Knowledge of interacting with web services via APIs is a plus, but could be learned during the project
Involved developer communities

The EcoData Retriever primarily interacts via issues and pull requests on GitHub.

Mentors 

Improving reproducibility in science by adding provenance tracking to the EcoData Retriever

Rationale 

The EcoData Retriever is a Python based tool for automatically downloading, cleaning up, and restructuring ecological data. It does the hard work of data munging so that scientists can focus on doing science.

Science is experiencing a reproducibility crisis in that it is becoming clear that published results often cannot be reproduced. Part of the challenge of reproducibility in data science style research is that most of the steps related to data: downloading it, cleaning it up, restructuring it, are either done manually or using one-off scripts and therefore this phase of the scientific process is not reproducible. The EcoData Retriever already solves many of these problems, but it doesn't currently keep track of exactly what has been done and therefore fails to support full reproducible workflows.

Approach 

This project would extend the EcoData Retriever using Python to store all of the metadata necessary for full reproduction in an associated SQLite database.

Specifically this would involve:

  • Object oriented programming in Python
  • Design an SQLite database for storing provenance information or use an existing framework (e.g,. http://code.google.com/p/core-provenance-library/)
  • Implement checks to make sure that the data is in the same form created by the Retriever when retrieving provenance information
Involved toolkits or projects 
  • The EcoData Retriever
  • Python
  • Relational database management systems (RDBMS) including MySQL, PostgreSQL, SQLite
  • Potentially an existing provenance library
Degree of difficulty and needed skills 
  • Moderate Difficulty
  • Knowledge of Python
  • Some experience with SQL
Involved developer communities

The EcoData Retriever primarily interacts via issues and pull requests on GitHub.

Mentors

Creating an rOpenTree package as part of rOpenSci

Rationale 

Tree of Life is a project that has a large collection of phylogenetic trees, a reference taxonomy and a draft of a complete tree of life. [2] builds R packages to access online data repositories. Having an rOpenSci package for accessing OpenTree data would benefit R users who need phylogenetic data as part of their analyses.

Approach 

This project would involve create an rOpenTree R package to access OpenTree APIs through R. .

Challenges 

What are the anticipated challenges?

  • It is not terribly hard to make an R client for a web API, but creating a good user interface can be difficult.
  • Creating robust functions that account for as many user errors as possible.
  • Integrating with the large suite of phylogenetics R packages already out there. A good start would be to convert data to commonly used R classes, especially the "phylo" class
  • Using phylogeny metadata that comes down from opentree is probably easiest with NeXML format. We are working on a NeXML client for R here. Phylogeny classes like "phylo" don't allow for rich metadata.
Involved toolkits or projects 

R, Open Tree of Life, git, Github

Degree of difficulty and needed skills 

Some familiarity with R would be ideal. Likewise, an understanding of the structure of phylogenetic data would be helpful.

Involved developer communities

Open Tree of Life communicates through a mailing list and on IRC (#opentreeoflife on freenode)? rOpenSci developers communicate through Github Issues pages for each respective software project rOpenSci on Github.

Mentors 

Calculating selection in metagenomic datasets

Rationale 

Microbial communities are found in almost every environment on the planet and have been implicated in many biological processes from climate change to human health and disease. Most microbial ecologist rely on metagenomics techniques to study communities as most bacteria are unable to be cultivated in the lab. Most metagenomic studies have focused on cataloguing what organisms are present and what they are capable of doing (functional analysis). Many researchers are now interested in finding out if and what kind of evolutionary selective pressures are found within these communities. One way to determine selective pressures is looking at dN/dS (Ka/Ks) ratios. Traditionally, dN/dS ratios are calculated with a reference genome but these are not available for the vast majority of microbes so new techniques have been developed to address this. We would like to develop a tool that calculates the dN/dS ratios for metagenomic datasets that do not have reference genomes.

Approach 

This project would involve developing an application that can take metagenomic sequence data and calculate the dN/dS ratio to determine if and what kind of selection is taking place within a population. One way to do this would be to align metagenomic reads to a consensus sequence generated from the reads and look for DNA bases that differ from the consensus sequence.

The general algorithm to be implemented is: align reads to a consensus sequence, detect where there are differences and determine if these differences change the protein encoded by the DNA.

The dN/dS ratio is: (# of observed non-synonymous mutations/# of non-synonymous sites) / (# of observed synonymous mutations/# of synonymous sites)

Synonymous and non-synonymous mutations are determined by aligning DNA reads to a consensus sequence, generated from a majority vote, and detecting where they are misaligned. One fast algorithm for aligning DNA reads is the Burrows-Wheeler transform. The algorithm itself does not account for mismatches in the alignment so the student will have to implement a way to handle that. Several solutions have been suggested in the literature so the student will need to evaluate which one would be best for this application.

A schematic of the overall approach can be found here.

Challenges 

The algorithm for calculating dN/dS is fairly straight forward. The biggest challenge will be efficiently doing this over millions of genes with billions of reads.

Involved toolkits or projects 
  • C++
  • Git
Degree of difficulty and needed skills 
  • Experience with C++
  • An interest in working with large datasets (100s of GB) and parallel processing
  • Knowledge or willingness to learn about methods for detecting selection
Involved developer communities

The long term goal is to fold this application into the free and open source program mothur which provides a toolkit for microbial ecologists to analyze DNA sequence data. Right now there is little infrastructure for metagenomic data in mothur but once it is implemented we can merge in this application. We would greatly appreciate someone who will stick around in the community to help out with this later on.

Mentors 


Jalview and Archaeopteryx: Interactive visual analytics for multiple alignment and phylogenetic data

Rationale 

Jalview includes basic support for phylogenetic analysis and tree visualization, but it could do better. Archaeopteryx is a powerful java based phylogeny visualization tool based on the forester package - but does not include any support for working with alignments or character matrices. Luckily, they are both implemented with similar technology: Java with a Swing user interface - so there's no reason why they can't be integrated in a single application!

Approach 

This project would involve creating a prototype version of the Jalview Desktop that integrates Archaeopteryx with Jalview's existing alignment and PCA visualizations, replacing Jalview's built in tree viewer. Once the integration is achieved, new directions might be profitably explored, including introducing the ability to analyse gene family alignments and trees in the context of species trees using Archaeopteryx's support for the NCBI species taxonomy.

Challenges 

Understanding Jalview and Archaeopteryx's datamodel and mapping them to allow data exchange. Understanding and adapting both application's GUI model to replace Jalview's existing tree viewer with an Archeopteryx window. Extending Jalview to take advantage of Archaeopteryx's own analysis capabilities (e.g. taxon name analysis, etc,) and support for phylogenetic formats like phyloXML and NEXUS.

Involved toolkits or projects 
Degree of difficulty and needed skills 

This is most likely a medium to difficult project. You will need to know how phylogenetic trees relate to multiple sequence alignments, and have the confidence to hack Java Swing Applications. There are lots of easy milestones in this project, but also some challenges - such as inferring taxon assignments with Archeopteryx's routines using the annotation that Jalview retrieves from online databases.

Involved developer communities

The student would work with the primary developers of Archeopteryx and Jalview, and interact via their respective email discussion lists, and issue trackers as well as on the NESCent community mailing list.

Mentors 

Mind the Gaps: exploring the divergent regions of multiple sequence alignments

Rationale 

It is common practice to remove or ignore certain columns in an MSA when calculating phylogenies. However, these excluded regions are informative, since they give information about physical constraints need to support the evolving organisms, or the introduction of novel function.

This project is about exploring these 'phylogenetically uninformative' regions in the context of an inferred phylogenetic history, by creating interactive alignment+phylogenetic tree visualizations that allow the functional and structural variations to be explored and annotated in-depth.

Approach 

Create a prototype version of Jalview with a visualization mode that allows the functional and structural relationships between the evolutionarily diverse regions of sequences to be explored. At its simplest, this could be new view modes that hide or summarise the regions used to calculate robust trees, and allow the dissimilar or gapped regions to be analysed by realignment or comparison with known phenotype. More advanced functionality could include processing the alignment using annotation or external services to automatically subdivide it into informative and uninformative regions - e.g. according to putative or known protein or RNA domain structure. In order to test the new functionality, you will need some accurate phylogenetic trees and their associated multiple alignments - these could be obtained by creating a client to import alignment data and phylogenies from online databases such as Treebase or TreeFam.

Challenges 

Learning to work with Jalview's data model, web service client architecture. Creating and implementing an effective user interaction model for visual analysis of the divergent regions in multiple alignments. Semantic analysis of biological sequence annotation to interpret existing alignments and associated phylogenetic trees, allowing the divergent regions to be analysed with the new divergent region analysis mode.

Involved toolkits or projects 

Java 1.7. Jalview.

Degree of difficulty and needed skills 

This is most likely a medium to difficult project. You will need to have reasonable understanding of how phylogenetic trees are calculated using multiple sequence alignments, and have the confidence to learn enough Java to hack Java Swing Applications. There are lots of easy milestones in this project, such as creating data service retrieval clients and basic functionality to detect and automatically hide regions in sequence alignment views according to measures of divergence or available sequence annotation. However, there are also more challenging aspects, including understanding how to allow a biologist might use the new functionality to quickly identify new functional elements from phylogenetic trees, sequence alignments and any available sequence annotation.

Involved developer communities

The student would work with the primary developer of Jalview and active members of the Jalview user community. Communication will be via Jalview's developer and user email discussion lists, the Jalview issue tracker and the NESCent community mailing list.

Mentors 
  • Jim Procter
  • Looking for a co-mentor for this project !

Build REST API for importing, searching, and downloading data in TraitDB to facilitate automated data collection and analysis

Rationale 

TraitDB is a Ruby on Rails web application designed to store a database of phenotypic traits. It is used at NESCent to support aggregation of trait data collected from literature by working groups.

TraitDB ingests data from CSV files, validates and parses them, and stores the data in a relational database. The focus so far has been on a customizable import engine and structure validator, available through a web interface.

One of the challenges with integrating an application into a scientific workflow is the reproducibility of the tasks. By providing external, scriptable interfaces for

  • Validating CSV spreadsheets prior to import
  • Loading CSV data
  • Searching for trait data
  • Downloading subsets of data across projects
  • Accessing statistics for traits in the database

We can improve the value of the application and facilitate its use into data workflows.

Approach 

This project would extend the TraitDB Rails codebase to provide access to its data using best practices for RESTful web services. The above interfaces would be among the required services, including building a system of generating API keys for access control and security best practices.

Challenges 

TraitDB is heavily focused on validating incoming data, creating order from potentially freeform CSV files. And any new interface to load data must subscribe to this focus. Additionally, it will be necessary to relay useful information about data structure issues and background job progress. As datasets can be more quickly loaded, the performance must scale accordingly.

Involved toolkits or projects 
  • TraitDB
  • Ruby on Rails, and several open source gems
  • Relational database management systems (RDBMS) including MySQL, PostgreSQL, SQLite
Degree of difficulty and needed skills 
  • Moderate Difficulty
  • Knowledge of Ruby on Rails
  • Knowledge of RESTful web services and designing APIs is desired, but this could be a focus of your learning
Involved developer communities

TraitDB is developed on GitHub, using issues and pull requests.

Mentors 

Extend functionality of owlet query processor

Rationale 

owlet is a query expansion preprocessor for SPARQL. It parses embedded OWL class expressions (in Manchester Syntax) and uses an OWL reasoner to replace them with FILTER statements containing the URIs of subclasses of the given class expression (or superclasses, equivalent classes, or instances).

OWL ontologies are widely used within the biosciences to provide shared semantics for annotated data, such as categorization of gene function, anatomical structures, or cell types. Data annotated using these terms from bio-ontologies may be conveniently stored in an RDF triplestore as linked data. However, bio-ontologies are often highly expressive, representing complex relationships using much of the OWL feature set. The query-time reasoning capabilities provided by triplestore SPARQL endpoints typically do not support arbitrary OWL expressions such as can be handled by in-memory reasoners working with the Java-based OWL API.

owlet is currently in use by the Phenoscape project, but could be made more feature complete.

Approach 

This project would implement needed features within the owlet query processor. These include:

  • Complete a partial implementation of a parser for the OWL Manchester notation. The unimplemented expressions include most of the OWL "data literal" features. The parser is implemented using Scala's parser combinator API.
  • Extend owlet to interpret SPARQL 1.1 property path syntax for increased control over the terms returned by an OWL query. For example:
    • rdfs:subClassOf --> direct subclasses
    • rdfs:subClassOf? --> direct subclasses including self
    • rdfs:subClassOf+ --> all subclasses
    • rdfs:subClassOf* --> all subclasses including self
Challenges 

Some understanding of the OWL ontology language and the Java OWL API will be needed.

owlet is implemented in Scala, a succinct language which runs on the JVM and provides powerful functional and object-oriented programming features. While this is not really a challenge, and actually a great improvement over Java, applicants may be less likely to be familiar with Scala. This project could be an opportunity to try Scala.

Involved toolkits or projects 
Degree of difficulty and needed skills 
  • Moderate or possibly low difficulty, depending on experience
  • Some knowledge of semantic web technologies
  • Scala programming experience, or willingness to learn, particularly with prior Java experience
Involved developer communities

owlet is developed on GitHub, and used by the Phenoscape project

Mentors 

Tool to convert phylogenies to OWL ontologies

Synopsis 

Create a tool that takes as input a phylogeny in one or more commonly used data exchange formats (Newick, NEXUS, NeXML), and that outputs a OWL ontology representing the phylogeny. The resulting OWL ontology will allow standard OWL-DL reasoners to infer a full set of relationships between elements of the phylogeny, and will use an encoding of the phylogeny that allows using a reasoner to query for most recent common ancestor-defined clades, their descendants, and their ancestors.

Rationale 

Although phylogenetics is rich in tools for reconstructing, analyzing, and visualizing phylogenies, many problems in data integration, mapping, and interoperability across phylogenies are only poorly addressed. However, domain-neutral solutions to such problems have been developed in the field of knowledge management. Specifically, OWL (the Web Ontology Language), OWL ontologies, and OWL reasoners promise to readily address some of the data integration and interoperability challenges, and are part of a thriving ecosystem of active research and supporting tools. Examples to which OWL and its tooling could be applied include the following.

  • Defining a fully computable, unambiguous expression for arbitrary clades that is independent of a specific phylogeny.
  • Applying such clade definitions to particular trees to identify elements (nodes, edges) that belong to the clade (if any), and those that do not.
  • Using such clade definitions to map associated data (characters, traits, habitats, ecology, geographic range, etc) to nodes or branches in a tree, in a way that preserves logical consistency across trees, and across changes to a tree.

The Phyloref project gives some examples and small proof-of-concept phylogeny-as-ontology examples demonstrating how this can work.

In addition to applications in inter-phylogeny interoperability, most biological data are linked directly or indirectly to biological taxa. Thus, to integrate data across organisms, efforts using ontologies to avail otherwise opaque descriptive data to computational mining have had to resort to using taxonomy ontologies created from biological taxonomies to include relationships of evolutionary ancestry in the computation, even though taxonomies are only a crude, and oftentimes outdated representation of current phylogenetic knowledge. With the first release of a full synthetic Tree of Life imminent from the Open Tree of Life project, an OWL ontology version of the tree could be readily used by the various endeavors integrating and analyzing ontological data annotations across organisms.

Approach 

The tool to be developed would be a command-line tool. In principle any one of a several widely used programming languages may be chosen. However, languages with existing libraries for phylogenetic data I/O (Perl/BioPerl, Python/Biopython, Java/Biojava, etc) and for OWL I/O (OWL API, which is in Java, but can be readily used from Scala, Clojure, and others running on the JVM) will have obvious advantages.

Supported phylogeny input formats should include at least Newick. NEXUS would be the obvious second choice. If time allows, the emerging standards NeXML and/or phyloXML would be a big plus. The scope would be limited to rooted phylogenies; this corresponds to biological taxonomies, as well as to the nature of a Tree of Life.

The output ontologies would use the CDAO ontology to represent the elements of the phylogenies and their relationships. They will probably also use the Phyloref ontology, and its approach for encoding the topology. Output ontologies should be no more expressive than OWL-DL, and preferably be within OWL-EL. Output ontologies are likely straightforward enough to be serialized in Turtle format, but this is not a requirement.

Challenges 

The challenges lie mostly in engineering this tool well. Most of the needed components (ontologies, encoding scheme, I/O and parsing libraries) exist, but not necessarily in the same programming language. The goal should be to create a tool that is reasonably cross-platform, and which is designed and documented in a straightforward enough manner such that others can readily understand and contribute to its continued development.

Involved toolkits or projects 
Degree of difficulty and needed skills 

Easy to moderately difficult, depending on the student's prior experience and skills. The scope of the project is well defined, as are the components (ontologies, topology encoding, OWL API for I/O, etc) that need to be put together. However, putting them together correctly and efficiently will require either having, or acquiring familiarity with phylogenetic trees, phylogenetic clade definitions, OWL ontologies, and OWL serialization.

Involved developer communities

This work would be part of the Phyloref project, which is developed on Github, using pull requests and issue tracker.

Mentors 

Template

edit this section to access the template