PhyloSoC: Automated submission of rich data to TreeBASE

From Phyloinformatics
Jump to: navigation, search

Project Author and Mentors

Check out the GSoC 2011 TreeBASE project blog!

Also, check out TreeBASE!

Abstract

TreeBASE acts as a archive for phylogenetic analyses. The current submission of data to TreeBASE is via NEXUS files. However, this format results in a clunky user interface and does not allow for automated submission of metadata or additional annotations to be added. This project will take on the task of accepting NeXML files to TreeBASE so that the submission process of metadata to be easily submitted and so that new annotations of the metadata can be displayed in a user-friendly manner.

Background and Purpose

At present time, TreeBASE serves as an archive for phylogenetic data. The user is able to search the database for different studies based on author, study ID, and other keywords found throughout the work. One is then able to browse through the metadata of the study. The user can search for taxa of interest based on several identifiers and results can link you to NCBI’s taxonomy browser or the Universal Biological Indexer and Organizer. One can also view the matrices used in the analyses, which provides a link to the original NEXUS file and a list of the sequences, but only the first 30 characters are visible. The trees displayed for a particular study can be further refined by topology type. All trees can be viewed using PhyloWidget. Although useful, there are numerous additional features that TreeBASE could be included that would maximize its usefulness and minimize the number of clicks the user has to make to navigate throughout the site.

One example of an annotation that would be useful to the TreeBASE user community include linking sequence data to the Genbank accession number so one could be directed to NCBI to directly access sequence data in order to utilize this information for future analyses, as well as including the geocoding the locality coordinates of the organisms included in the study. These are only a few of many annotations that could be incorporated into TreeBASE to improve the utility of this resource. In order to expand the annotations of the data in TreeBASE, the submission process of TreeBASE data must be further refined to be more user-friendly. My project for Google Summer of Code 2011 will allow for the submission of phylogenetic data to TreeBASE so that both the data and metadata are exported in a way that would simplify the user’s interaction with the website. One example of an annotation that would be useful to the TreeBASE user community include linking sequence data to the Genbank accession number so one could be directed to NCBI to directly access sequence data in order to utilize this information for future analyses, as well as including the geocoding the locality coordinates of the organisms included in the study. These are only a few of many annotations that could be incorporated into TreeBASE to improve the utility of this resource. In order to expand the annotations of the data in TreeBASE, the submission process of TreeBASE data must be further refined to be more user-friendly. My project for Google Summer of Code 2011 will allow for the submission of phylogenetic data to TreeBASE so that both the data and metadata are exported in a way that would simplify the user’s interaction with the website.

Project Goals

By the end of the term, we would like to accomplish the following:

  1. Build core code base locally.
  2. Add CHARSET functionality to NeXML
  3. Include row-segment metadata in NeXML output

Timeline for Project Plan

4/25-5/22: Community Bonding Period
Goal 1: Build core code base locally. Completed
Comments: See blog for more details on building the code base. Familiarized myself with areas of code that I would be working on. This includes the NexmlObjectConverter class, along with its subclasses and superclasses. Took a look at the NexmlSerializationTest JUnit class, as well. Read about topics such as object-relational mapping with Hibernate and how Spring beans are configured to have other objects assigned to them automatically.
5/23-5/29: Week One
Focus on org.nexml.model.* sections of the code base. NeXML solution both for expressing CHARSET free text and for expressing the row-segment metadata like the Genbank accession number. It is currently only working for NEXUS. Develop unit test to get it working. Think about ways to add charsets and figure out how to implement. Get back to Rutger about whethor or not any changes to NeXML interface is necessary.
Comments: Studied code and starting to figure out how pieces fit together. In order to express row-segment metadata, I learned that I am going to be adding code to the NexmlMatrixConverter class. Ran into some setbacks--code base had problems so spent a day and half trying to get it working again. It is working now :) Many of my goals for Week One are going to be moved to Week Two. I have been reading up on tutorials on JUnit tests.
5/30-6/05: Week Two
Write JUnit test to represent row-segment metadata in NexmlMatrixConverter class. This is my first time doing so, so this will be challenging. Looking at tests that are already written for the code is a good place to start. After getting this implemented and working, I will commit the code to TreeBASE. If there are issues representing the metadata, consult outside stakeholders. For example, one major annotation that needs to be included is Genbank accession number. If we are having issues with this, it may be useful to contact outside party related to Genbank.
Comments: I wrote a very short and incorrect JUnit test. Had a tough time figuring out exactly what I was doing. Turns out I did not know as much about how the metadata worked in TreeBASE or as much about CHARSETS as I thought..I had lengthy discussions over e-mail with Bill and Rutger to gain a clearer picture of what I was trying to do. They helped lay out the problem and got me back on my feet. See my blog post for more information on this discussion.
6/06-6/12: Week Three
Rutger wrote a very well-commented JUnit test within the NexmlMatrixConverterTest class that verified studies and matrices from TreeBASE were matching up with the NeXML matrices. This set the stage for me to expand upon this and add a functional unit test that would verify that the TreeBASE character sets matched up with the name and coordinated of the NeXML. I was then to apply this logic from these tests to the actual NexmlMatrixConverter class (particularly fromTreebaseToXML() methods) to implement it.
Comments: I wrote a much better and working JUnit test and committed it. I was asked to expand upon it forward to help prevent any NullPointerException issues. I added my code logic to the NexmlMatrixConverter class but have not committed it. I need to wait until it passes more of our tests first.
6/13-6/19: Week Four
Prepare for conference and find out if there are people I want to contact while there. Continued working on JUnit tests and Rutger added more detail to my portion of it to ensure that everything is working. Once it passes these, I will be able to commit it.
June 17-22: Attending Evolution & iEvoBio 2011 in Norman, OK
6/20-6/26: Week Five
Goal 2: Add CHARSET functionality to NeXML. Completed
Work on correcting issues from NexmlMatrixConverter class. Right now, my code is not matching the TreeBASE character label and the NeXML character label.
Comments: The NexmlMatrixConverter class issues are now worked out and the code for expressing charsets in NeXML has been committed. See blog for more details. Next, we will be working on trying to include row-segment metadata in NeXML output. I received some further detail about this from Bill Piel and will be working on figuring out where to start.
6/27-7/03: Week Six
Research how to include row-segment metadata. Communicate with Bill and Rutger for more information. Work on unit tests for row-segment metadata.
Comments:
7/04-7/10: Week Seven
Write code from unit tests in actual code. See if it passes all of the unit tests and make any corrections necessary.
7/11-7/17: Week Eight
Mid-term Evaluations.
7/18-7/24: Week Nine
Unit testing of persisting annotations.
7/25-7/31: Week Ten
Write code to persist annotations and commit if working.
8/01-8/07: Week Eleven
Consider adding additional features if time permits.
8/08-8/14: Week Twelve
Documentation. Write final evaluation of results.
8/15: Pencils Down
Submit Final Evaluations.