PhyloSoC: Automated submission of rich data to TreeBASE

From Phyloinformatics
Revision as of 16:13, 25 May 2011 by Loloyohe (talk) (Timeline for Project Plan)
Jump to: navigation, search

Project Author and Mentors

Check out the GSoC 2011 TreeBASE project blog!

Also, check out TreeBASE!

Abstract

TreeBASE acts as a archive for phylogenetic analyses. The current submission of data to TreeBASE is via NEXUS files. However, this format results in a clunky user interface and does not allow for automated submission of metadata or additional annotations to be added. This project will take on the task of accepting NeXML files to TreeBASE so that the submission process of metadata to be easily submitted and so that new annotations of the metadata can be displayed in a user-friendly manner.

Background and Purpose

At present time, TreeBASE serves as an archive for phylogenetic data. The user is able to search the database for different studies based on author, study ID, and other keywords found throughout the work. One is then able to browse through the metadata of the study. The user can search for taxa of interest based on several identifiers and results can link you to NCBI’s taxonomy browser or the Universal Biological Indexer and Organizer. One can also view the matrices used in the analyses, which provides a link to the original NEXUS file and a list of the sequences, but only the first 30 characters are visible. The trees displayed for a particular study can be further refined by topology type. All trees can be viewed using PhyloWidget. Although useful, there are numerous additional features that TreeBASE could be included that would maximize its usefulness and minimize the number of clicks the user has to make to navigate throughout the site.

One example of an annotation that would be useful to the TreeBASE user community include linking sequence data to the Genbank accession number so one could be directed to NCBI to directly access sequence data in order to utilize this information for future analyses, as well as including the geocoding the locality coordinates of the organisms included in the study. These are only a few of many annotations that could be incorporated into TreeBASE to improve the utility of this resource. In order to expand the annotations of the data in TreeBASE, the submission process of TreeBASE data must be further refined to be more user-friendly. My project for Google Summer of Code 2011 will allow for the submission of phylogenetic data to TreeBASE so that both the data and metadata are exported in a way that would simplify the user’s interaction with the website. One example of an annotation that would be useful to the TreeBASE user community include linking sequence data to the Genbank accession number so one could be directed to NCBI to directly access sequence data in order to utilize this information for future analyses, as well as including the geocoding the locality coordinates of the organisms included in the study. These are only a few of many annotations that could be incorporated into TreeBASE to improve the utility of this resource. In order to expand the annotations of the data in TreeBASE, the submission process of TreeBASE data must be further refined to be more user-friendly. My project for Google Summer of Code 2011 will allow for the submission of phylogenetic data to TreeBASE so that both the data and metadata are exported in a way that would simplify the user’s interaction with the website.

Project Goals

By the end of the term, we would like to accomplish the following:

  1. Build core code base locally.

Timeline for Project Plan

4/25-5/22: Community Bonding Period
Goal 1: Build core code base locally. Completed
Comments: See blog for more details on building the code base. Familiarized myself with areas of code that I would be working on. This includes the NexmlObjectConverter class, along with its subclasses and superclasses. Took a look at the NexmlSerializationTest JUnit class, as well. Read about topics such as object-relational mapping with Hibernate and how Spring beans are configured to have other objects assigned to them automatically.
5/23-5/29: Week One
Focus on org.nexml.model.* sections of the code base. NeXML solution both for expressing CHARSET free text and for expressing the row-segment metadata like the Genbank accession number. It is currently only working for NEXUS. Develop unit test to get it working. Think about ways to add charsets and figure out how to implement. Get back to Rutger about whethor or not any changes to NeXML interface is necessary.
Comments: Studied code and starting to figure out how pieces fit together. In order to express row-segment metadata, I learned that I am going to be adding code to the NexmlMatrixConverter class.
5/30-6/05: Week Two
If unit tests are not finished, complete those and once this is working, commit the code to TreeBASE. If there are issues representing the metadata, consult outside stakeholders.
6/06-6/12: Week Three
Figure out how to traverse annotations and parse the NeXML file. Write JUnit tests to traverse annotations.
6/13-6/19: Week Four
Prepare for conference and find out if there are people I want to contact while there. Continue working on JUnit tests.
June 17-22: Attending Evolution & iEvoBio 2011 in Norman, OK
6/20-6/26: Week Five
Set up relational database portion of the code base. Start figuring out how it works and think about ways to map annotations onto TreeBASE database.
6/27-7/03: Week Six
Unit testing of functionality of mapping the annotations.
7/04-7/10: Week Seven
Write code to map the annotations and commit it if working.
7/11-7/17: Week Eight
Mid-term Evaluations.
7/18-7/24: Week Nine
7/25-7/31: Week Ten
8/01-8/07: Week Eleven
8/08-8/14: Week Twelve
8/15: Pencils Down
Submit Final Evaluations.