Cyberinfrastructure Summer Traineeships 2009/Summaries

From Phyloinformatics
Jump to: navigation, search

Student software developers showcase their work

Congratulations to the following student software developers on successful completion of their summer projects. Funded by the NSF Virtual Data Center project (DBI-0753138), graduate students and postdoctoral researchers were given the opportunity to work remotely on an informatics project of their choosing, under the guidance of an experienced mentor. Collectively their projects will help to build a Virtual Data Center that includes major data repositories in biodiversity, earth and environment science, ecology and evolutionary biology. As their profiles demonstrate, the students put their summers to very good use! To meet the students and learn more about their projects, read below.



Student: Christine Dumoulin

Mentor(s): Bruce Wilson, Dave Vieglais, Giri Palanisamy

Project: Generating accurate ranking algorithms via machine learning

Metadata — data that describe other data — accompany every scientific data set, and their richness, quality, and interoperability determine how well data sets can be found, understood, and re-used. A number of standards are used to organize metadata, and their structural differences create difficulties for data sharing and other collaborative efforts. Metadata crosswalks remove the barrier to interoperability by providing a map between one standard and another. Crosswalks are difficult to generate because they often consist of too many terms to map by hand, and are written in natural language that cannot be effectively processed by software alone. I have written a program with which a user can construct and evaluate processing algorithms without the need to write new code. This tool allows users to quickly compare different algorithms which are suited to build crosswalks between the metadata standards that are relevant to their work.



Student: Namrata Lele

Mentor(s): Dave Vieglais, Bruce Wilson

Project: Vocabulary Term Mapping

My project consisted of implementing a tool that helps with mapping semantically similar terms between metadata vocabularies, which is important when integrating or linking data sets annotated by different and typically incompatible metadata standards. The tool I created takes two sets of terms and their descriptions as input, and returns a ranked list of matching candidate terms from the two sets. For implementation, I took advantage of the Lucene indexing system, and the General Text Matcher libraries, and I wrote parsers for different metadata standards, including Dublin Core, Darwin Core, and the Ecological Metadata Language.



Student: Serhan Akin

Mentor(s): Matt Jones, Mark Servilla

Project: Refactoring the EarthGrid SOAP API to REST style and an implementation for Metacat

EarthGrid is a lightweight Application Programming Interface (API) for web-based communication with diverse environmental data systems. Until now, it was only supported through the SOAP protocol, which involves a significant overhead for client programmers and suffers from uneven support across programming languages. This project involved refactoring current SOAP-based Earthgrid API to a REST-based architecture. REST exploits the existing technology and protocols of the Web, while decoupling the implementation of a system from its interface. I implemented a prototype within the Metacat data management system to demonstrate the effectiveness of this architecture. The results of this project will serve as a basis for the first Virtual Data Center REST API, which is currently being designed.



Student: John Harney

Mentor(s): Hilmar Lapp, Rutger Vos

Project: Semantic phyloinformatic web services using the EvoInfo stack

Scientific web services produce and consume data in many different, loosely defined formats and messaging protocols. These heterogeneities make it especially difficult for services to be discovered and used in applications. This project aimed to address this problem by mapping components of a web service interface, such as input and output data elements and operations, to concepts in a well-defined and structured semantic model (i.e. ontology). This gives prospective consumers of a web service access to the service’s functionality even if they are unfamiliar with the specific syntax of its service interfaces. To demonstrate the advantages of attaching semantic meaning to scientific web services, I implemented a rudimentary proof of concept service by utilizing the latest versions of the Web Services Description Language (WSDL 2.0) and Semantic Annotations for WSDL (SAWSDL) standard recommendations, using the data exchange (NeXML), data semantics (CDAO), and data access (phyloWS) standards for phylogenetics recently developed by the NESCent EvoInfo Working Group.