Phyloinformatics Summer of Code 2010/Summaries

From Phyloinformatics
Revision as of 18:54, 7 September 2010 by Ras10 (talk) (New page: == Student software developers showcase their work == For the fourth summer in a row, NESCent offered a number of internships aimed at introducing students to open-source software develop...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Student software developers showcase their work

For the fourth summer in a row, NESCent offered a number of internships aimed at introducing students to open-source software development. This summer, five interns from the 2010 Google Summer of CodeTM program worked remotely on an evoinformatics project of their own choosing, each under the guidance of an experienced mentor.

NESCent’s 2010 Summer of Code students included Filip Balejko from the University of Warsaw, Kathryn Iverson from the University of Michigan, Lauren Lui from UC Santa Cruz, Anurag Priyam from the Indian Institute of Technology Kharagpur, and Conrad Stack from Pennsylvania State University. Their projects ranged from adding phylogenetics analysis steps to a popular data processing workflow system, to adding support for emerging interoperability standards to programming libraries, to creating a programming toolkit that can create Google Earthcompatible geophylogenies. As their profiles demonstrate, the students put their summers to very good use! To meet the students and learn more about their projects, read below.


Balejko 1.jpg
Balejko 1.png

Student: Filip Balejko (University of Warsaw)

Mentor(s): Sergei Kosakovsky Pond, Brad Chapman, Anton Nekrutenko

Project: Galaxy phylogenetics pipeline development

Galaxy is a popular web based interface for integrating biological tools and analysis pipelines. HyPhy provides a popular package for molecular evolution and sequence statistical analysis. This project brings HyPhy workflows into the Galaxy system, standardizing these analyses on a widely used platform. The goal of my project was to design an implementation pattern and integrate SLAC and PARRIS analyses. In order to meet the requirements of the SLAC analysis my implementation creates jobs for generating additional output and provides a simple and intuitive AJAX interface. __________________________


Student: Nick Matzke

Mentor(s): Stephen Smith, Brad Chapman, David Kidd

Project: Biogeographical Phylogenetics for BioPython

I developed Bio.Geography, a new module for the bioinformatics programming toolkit Biopython. Bio.Geography expands upon Biopython's traditional capabilities for accessing gene and protein sequences from online databases by allowing automated searching, downloading, and parsing of geographic location records from GBIF, the authoritative aggregator of specimen information from natural history collections worldwide. This will enable analyses of evolutionary biogeography that require the areas inhabited by the species at the tips of the phylogeny, particularly for large-scale analyses where it is necessary to process thousands of specimen occurrence records. The module will also facilitate applications such as species mapping, niche modeling, error-checking of museum records, and monitoring range changes.



Student: Eric Talevich

Mentor(s): Brad Chapman, Christian Zmasek

Project: Biopython support for parsing and writing phyloXML

The phyloXML data exchange format provides a consistent way to store and share information about richly annotated phylogenetic trees, including geographic, taxonomic and sequence-level data. However, researchers can only benefit from this if existing libraries and toolkits support this format. To support phyloXML in the bioinformatics programming toolkit Biopython, I created a pair of new modules, Tree and TreeIO, that offer a common interface for reading and writing phylogenetic trees in several file formats. These modules generalized the underlying tree model to provide a foundation for enhanced phylogenetics support in Biopython in the future.



Student: Diana Jaunzeikare

Mentor(s): Christian Zmasek, Pjotr Prins, Naohisa Goto

Project: Implementing phyloXML support in BioRuby

The phyloXML data exchange format facilitates analysis, exchange, storage and reuse of phylogenetic trees and associated data. My goal was to implement reading and writing capabilities for phyloXML within the bioinformatics programming toolkit BioRuby. My code takes advantages of specialized XML libraries to boost the speed at which extremely large phylogenetic datasets can be processed.



Student: Chase Miller

Mentor(s): Mark Jensen, Rutger Vos

Project: BioPerl integration of the NeXML exchange standard and Bio::Phylo toolkit

NeXML is an XML-based data exchange standard for phylogenetic data that represents the next generation of the popular NEXUS standard. My project consisted of integrating the NeXML standard into the BioPerl programming toolkit. To accomplish this, I developed several modules that provided BioPerl native access to the reference NeXML parser in Perl (Bio::Phylo), while still allowing Bio::Phylo and NeXML to co-evolve independently of the much larger BioPerl project.



Student: Adam Smith

Mentor(s): Pjotr Prins, Chris Fields

Project: Mapping the Bio++ Phylogenetics toolkit to R/BioConductor and BioJAVA using BioLib

Bio++ is a programming toolkit written in C++ for sequence analysis, phylogenetics, molecular evolution and population genetics. My goal was to make the Bio++ functionality accessible from high-level scripting languages such as Perl, Python, and Ruby. I accomplished this with a tool called SWIG (Simplified Wrapper Interface Generator), which generalized the task of programming interfaces to the different scripting languages. The end result is that biologists can spend more time concentrating on solving their biological analysis questions with the most powerful tools available, and less time worrying about interfacing with a toolkit written in an unfamiliar low-level programming language.



Student: Xin (David) Shuai

Mentor(s): Chris Fields, Mark A. Jensen and Pjotr Prins

Project: A BioLib mapping for the libsequence population genetic libraries

BioLib brings together a set of open-source libraries written in C/C++, and makes them available to the higher-level Bio* scripting language toolkits, such as BioPerl and BioPython. Libsequence is a C++ library for population genetic simulation and the evolutionary analysis of molecular data. The goal of my project was to add libseqence into BioLib and build mappings to Perl and Python using a standard framework (SWIG)].



Student: Kasia Hayden

Mentor(s): Peter Midford, Jim Balhoff

Project: Build a Mesquite package to view Phenex-generated Nexml files

Phenex is a stand-alone tool that assists biological experts in transforming phenotypic characters and character states from phylogenetic data matrices into formal phenotype assertions typed by ontology terms. The goal of my project was to take the annotated character matrices produced by Phenex (exported in NeXML format) and make them viewable in Mesquite, one of the most widely used desktop tools for exploring, visualizing, and analyzing phylogenetic data. The annotations can be viewed both as text, in a footnote box, and in a graphical display. For the latter, I used Graphviz, a general purpose package for drawing graphs in Java.



Student: Dazhi Jiao

Mentor(s): Ryan Scherle, Lucie Chan

Project: PhyloSoC:Enhance the searching functionality of Phylr

Phylr is an implementation of the search functionality described in the emerging PhyloWS “web service” standard for accessing online phylogenetic data resources. My project consisted of enhancing the Phylr service for searching content in a collection of XML files, using a fast text indexing system called Lucene, as well as creating new services for searching relational databases. Specifically, I focused on searching a BioSQL relational database with the PhyloDB extension, because this database is highly flexible in terms of supported metadata and is tightly integrated with the Bio* bioinformatics scripting toolkits. As an initial prototype, I demonstrated how Phylr could search TreeBASE data that has been stored in a BioSQL database.