Phyloinformatics Summer of Code 2009/Summaries

From Phyloinformatics
Revision as of 11:15, 16 September 2009 by Ras10 (talk) (New page: Student software developers showcase their work For the third summer in a row, NESCent offered a number of student internships aimed at expanding participation in collaborative open-sourc...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Student software developers showcase their work

For the third summer in a row, NESCent offered a number of student internships aimed at expanding participation in collaborative open-source software development projects. Interns from around the world get the opportunity to work remotely on an evoinformatics project of their own choosing, each under the guidance of an experienced mentor. This summer, NESCent received funding for 13 students; nine from the Google Summer of Code™ program, and four from the NSF Virtual Data Center project (DBI-0753138). As their profiles demonstrate, the students put their summers to very good use! Meet the students and learn more about their projects. __________________________

Student: Daniel Ayres Mentor(s): Aaron Darling, Marc Suchard Project: GPU acceleration for phylogenetic inference using OpenCL

My project was to expand upon BEAGLE, an open-source computing library that accelerates phylogenetic tree inference by using the powerful Graphics Processing Units (GPUs) that are found within modern high-end desktop and laptop computers. My main goal was for BEAGLE to work across a wide range of GPUs by using OpenCL, a vendor-neutral standard. I also expanded the library so that it could be used with a broader set of evolutionary tools. __________________________

Student: Nick Matzke Mentor(s): Stephen Smith, Brad Chapman, David Kidd Project: Biogeographical Phylogenetics for BioPython

I developed Bio.Geography, a new module for the bioinformatics programming toolkit Biopython. Bio.Geography expands upon Biopython's traditional capabilities for accessing gene and protein sequences from online databases by allowing automated searching, downloading, and parsing of geographic location records from GBIF, the authoritative aggregator of specimen information from natural history collections worldwide. This will enable analyses of evolutionary biogeography that require the areas inhabited by the species at the tips of the phylogeny, particularly for large-scale analyses where it is necessary to process thousands of specimen occurrence records. The module will also facilitate applications such as species mapping, niche modeling, error-checking of museum records, and monitoring range changes. __________________________

Student: Eric Talevich Mentor(s): Brad Chapman, Christian Zmasek Project: Biopython support for parsing and writing phyloXML

The phyloXML data exchange format provides a consistent way to store and share information about richly annotated phylogenetic trees, including geographic, taxonomic and sequence-level data. However, researchers can only benefit from this if existing libraries and toolkits support this format. To support phyloXML in the bioinformatics programming toolkit Biopython, I created a pair of new modules, Tree and TreeIO, that offer a common interface for reading and writing phylogenetic trees in several file formats. These modules generalized the underlying tree model to provide a foundation for enhanced phylogenetics support in Biopython in the future. __________________________

Student: Diana Jaunzeikare Mentor(s): Christian Zmasek, Pjotr Prins, Naohisa Goto Project: Implementing phyloXML support in BioRuby

The phyloXML data exchange format facilitates analysis, exchange, storage and reuse of phylogenetic trees and associated data. My goal was to implement reading and writing capabilities for PhyloXML within the bioinformatics programming toolkit BioRuby. My code takes advantages of specialized XML libraries to boost the speed at which extremely large phylogenetic datasets can be processed. __________________________

Student: Chase Miller Mentor(s): Mark Jensen, Rutger Vos Project: BioPerl integration of the NeXML exchange standard and Bio::Phylo toolkit

NeXML is an XML-based data exchange standard for phylogenetic data that represents the next generation of the popular NEXUS standard. My project consisted of integrating the NeXML standard into the BioPerl programming toolkit. To accomplish this, I developed several modules that provided BioPerl native access to the reference NeXML parser in Perl (Bio::Phylo), while still allowing Bio::Phylo and NeXML to co-evolve independently of the much larger BioPerl project. __________________________

Student: Adam Smith Mentor(s): Pjotr Prins, Chris Fields Project: Mapping the Bio++ Phylogenetics toolkit to R/BioConductor and BioJAVA using BioLib

Bio++ is a programming toolkit written in C++ for sequence analysis, phylogenetics, molecular evolution and population genetics. My goal was to make the Bio++ functionality accessible from high-level scripting languages such as Perl, Python, and Ruby. I accomplished this with a tool called SWIG (Simplified Wrapper Interface Generator), which generalized the task of programming interfaces to the different scripting languages. The end result is that biologists can spend more time concentrating on solving their biological analysis questions with the most powerful tools available, and less time worrying about interfacing with a toolkit written in an unfamiliar low-level programming language. __________________________

Student: Xin (David) Shuai Mentor(s): Chris Fields, Mark A. Jensen and Pjotr Prins Project: A BioLib mapping for the libsequence population genetic libraries

BioLib brings together a set of open-source libraries written in C/C++, and makes them available to the higher-level Bio* scripting language toolkits, such as BioPerl and BioPython. Libsequence is a C++ library for population genetic simulation and the evolutionary analysis of molecular data. The goal of my project was to add libseqence into BioLib and build mappings to Perl and Python using a standard framework (SWIG).


Student: Kasia Hayden Mentor(s): Peter Midford, Jim Balhoff Project: Build a Mesquite package to view Phenex-generated Nexml files

Phenex is a stand-alone tool that assists biological experts in transforming phenotypic characters and character states from phylogenetic data matrices into formal phenotype assertions typed by ontology terms. The goal of my project was to take the annotated character matrices produced by Phenex (exported in NeXML format) and make them viewable in Mesquite, one of the most widely used desktop tools for exploring, visualizing, and analyzing phylogenetic data. The annotations can be viewed both as text, in a footnote box, and in a graphical display. For the latter, I used Graphviz, a general purpose package for drawing graphs in Java. __________________________

Student: Dazhi Jiao Mentor(s): Ryan Scherle, Lucie Chan Project: PhyloSoC:Enhance the searching functionality of Phylr

Phylr is an implementation of the search functionality described in the emerging PhyloWS “web service” standard for accessing online phylogenetic data resources. My project consisted of enhancing the Phylr service for searching content in a collection of XML files, using a fast text indexing system called Lucene, as well as creating new services for searching relational databases. Specifically, I focused on searching a BioSQL relational database with the PhyloDB extension, because this database is highly flexible in terms of supported metadata and is tightly integrated with the Bio* bioinformatics scripting toolkits. As an initial prototype, I demonstrated how Phylr could search TreeBASE data that has been stored in a BioSQL database.


[the next four are all VDC students]

Student: Christine Dumoulin Mentor(s): Mentor(s): Bruce Wilson, Dave Vieglais, Giri Palanisamy Project: Generating accurate ranking algorithms via machine learning

Metadata — data that describe other data — accompany every scientific data set, and their richness, quality, and interoperability determine how well data sets can be found, understood, and re-used. A number of standards are used to organize metadata, and their structural differences create difficulties for data sharing and other collaborative efforts. Metadata crosswalks remove the barrier to interoperability by providing a map between one standard and another. Crosswalks are difficult to generate because they often consist of too many terms to map by hand, and are written in natural language that cannot be effectively processed by software alone. I have written a program with which a user can construct and evaluate processing algorithms without the need to write new code. This tool allows users to quickly compare different algorithms which are suited to build crosswalks between the metadata standards that are relevant to their work. __________________________

Student: Namrata Lele Mentor(s): Dave Vieglais, Bruce Wilson Project: Vocabulary Term Mapping

My project consisted of implementing a tool that helps with mapping semantically similar terms between metadata vocabularies, which is important when integrating or linking data sets annotated by different and typically incompatible metadata standards. The tool I created takes two sets of terms and their descriptions as input, and returns a ranked list of matching candidate terms from the two sets. For implementation, I took advantage of the Lucene indexing system, and the General Text Matcher libraries, and I wrote parsers for different metadata standards, including Dublin Core, Darwin Core, and the Ecological Metadata Language. __________________________

Student: Serhan Akin Mentor(s): Matt Jones, Mark Servilla Project: Refactoring the EarthGrid SOAP API to REST style and an implementation for Metacat

EarthGrid is a lightweight Application Programming Interface (API) for web-based communication with diverse environmental data systems. Until now, it was only supported through the SOAP protocol, which involves a significant overhead for client programmers and suffers from uneven support across programming languages. This project involved refactoring current SOAP-based Earthgrid API to a REST-based architecture. REST exploits the existing technology and protocols of the Web, while decoupling the implementation of a system from its interface. I implemented a prototype within the Metacat data management system to demonstrate the effectiveness of this architecture. The results of this project will serve as a basis for the first Virtual Data Center REST API, which is currently being designed. __________________________

Name: John Harney Mentor(s): Hilmar Lapp, Rutger Vos Project: Semantic phyloinformatic web services using the EvoInfo stack

Scientific web services produce and consume data in many different, loosely defined formats and messaging protocols. These heterogeneities make it especially difficult for services to be discovered and used in applications. This project aimed to address this problem by mapping components of a web service interface, such as input and output data elements and operations, to concepts in a well-defined and structured semantic model (i.e. ontology). This gives prospective consumers of a web service access to the service’s functionality even if they are unfamiliar with the specific syntax of its service interfaces. To demonstrate the advantages of attaching semantic meaning to scientific web services, I implemented a rudimentary proof of concept service by utilizing the latest versions of the Web Services Description Language (WSDL 2.0) and Semantic Annotations for WSDL (SAWSDL) standard recommendations, using the data exchange (NeXML), data semantics (CDAO), and data access (phyloWS) standards for phylogenetics recently developed by the NESCent EvoInfo Working Group.