Difference between revisions of "PhyloSoC:BioPerl integration of the NeXML exchange standard and Bio::Phylo toolkit"

From Phyloinformatics
Jump to: navigation, search
m ('''May 23rd - June 5th''')
m ('''May 23rd - June 5th''')
Line 43: Line 43:
 
*Bio::Phylo::Forest::Tree to Bio::Tree
 
*Bio::Phylo::Forest::Tree to Bio::Tree
 
**Implement the Bio::TreeIO::nexml::next_tree() method, which will override the Bio::TreeIO::next_tree() method, giving native support for reading nexml files
 
**Implement the Bio::TreeIO::nexml::next_tree() method, which will override the Bio::TreeIO::next_tree() method, giving native support for reading nexml files
**Use Bio::Phylo
 
 
**Use Bio::Phylo to validate the nexml file, catch any exceptions during the validation process
 
**Use Bio::Phylo to validate the nexml file, catch any exceptions during the validation process
  

Revision as of 16:13, 26 May 2009

2009 Google Summer of Code Project

Title

BioPerl integration of the NeXML exchange standard and Bio::Phylo toolkit

Authors

Student: Chase Miller

Mentors: Mark A. Jensens (primary) Rutger Vos

Project Blog

http://gsocnexml.blogspot.com/

Abstract

This project will integrate the NeXML exchange standard into BioPerl, facilitating the adoption of this standard and easing the transition from the overworked NEXUS standard. A wrapper will be used to allow BioPerl native access to the preferred NeXML parser (Bio::Phylo), allowing Bio::Phylo and NeXML to co-evolve without being encumbered by BioPerl. Additionally, test cases and example sets will be developed that target real world uses.

Background

Phylogenetic trees, like all visual data representations, require special consideration to represent the complete structure in a linear non-visual data format. The current accepted standard is the NEXUS file format, which is a highly expressive and popular file format. However, it has become increasingly cumbersome to use due to three main reasons: “(i) deficiencies in the standard itself, (ii) variable level of support for standard feature in current software, and (iii) an extremely high incidence of illegal files (a byproduct of the fact that most parsers are ‘forgiving’ of errors)” 1. These problems have sparked a transition away from the NEXUS standard to a more flexible, extensible, and strictly defined standard. One candidate is the NeXML exchange standard, which, in addition to satisfying the goals of being more extensible and more strictly defined, leverages the power of pre-existing XML tools and capabilities (such as parser libraries, web service toolkits, and serialization) to make a more robust product 2. One of the current barriers of adopting the NeXML standard into wider use is the lack of support for NeXML in popular programming tools. One way to lower the entrance barrier for use of the new NeXML standard is to incorporate the preferred NeXML parser (Bio::Phylo) into BioPerl - a popular bioinformatics framework. However, the NeXML standard is rapidly evolving and would be hampered by having to interact directly with the large BioPerl project. To circumvent these potential pitfalls, a lightweight dynamic wrapper will be created around Bio::Phylo. In this way BioPerl users will be given access to the full suite of functionality provided by Bio::Phylo without encumbering the future development of the NeXML standard or the Bio::Phylo module. To facilitate the use of the newly designed wrapper and the NeXML standard, test cases and example sets will be developed that target real world uses. This project will be an important step in the transition away from an overworked standard to the more powerful and extensible NeXML exchange standard, inherently increasing productivity along the way.

Project Plan

May 23rd - June 5th

NeXML to BioPerl Objects

Write modules for reading nexml files into BioPerl objects by creating a thin wrapper around Bio::Phylo. Bio::Phylo will be used to parse the nexml files into a Bio::Phylo object and then the Bio::Phylo attributes will be mapped to Bioperl attributes.
  • Bio::Phylo::Matrices::Datum to Bio::Seq
    • Implement the Bio::SeqIO::nexml::next_seq() method, which will override the Bio::SeqIO::next_seq() method, giving native support for reading nexml files.
    • Use Bio::Phylo to validate the nexml file, catch any exceptions during the validation process
  • Bio::Phylo::Matrices::Matrix to Bio::Align
    • Implement the Bio::AlignIO::nexml::next_aln() method, which will override the Bio::AlignIO::next_aln() method, giving native support for reading nexml files
    • Use Bio::Phylo to validate the nexml file, catch any exceptions during the validation process
  • Bio::Phylo::Forest::Tree to Bio::Tree
    • Implement the Bio::TreeIO::nexml::next_tree() method, which will override the Bio::TreeIO::next_tree() method, giving native support for reading nexml files
    • Use Bio::Phylo to validate the nexml file, catch any exceptions during the validation process

June 6th - June 12th

BioPerl Objects to NeXML

Write module to allow BioPerl objects to write data to a Nexml file. The Bioperl attributes will be mapped to Bio::Phylo attributes and then Bio::Phylo will generate a nexml file.
  • Bio::Seq to Bio::Phylo::Matrices::Datum
    • Create a Bio::SeqIO::nexml.pm module that overwrites the write_seq() method using the code from the Bio::Phylo::Matrices::Datum->new_from_bioperl() method
    • Use Bio::Phylo to validate the nexml file and catch any invalid exceptions that are thrown
  • Bio::Align to Bio::Phylo::Matrices::Matrix
    • Create a Bio::AlignIO::nexml.pm module that overwrites the write_aln() method using the code from the Bio::Phylo::Matrices::Matrix->new_from_bioperl() method
    • Use Bio::Phylo to validate the nexml file and catch any invalid exceptions that are thrown
  • Bio::Tree to Bio::Phylo::Forest::Tree
    • Create a Bio::TreeIO::nexml.pm module that overwrites the write_tree() method using the code from the Bio::Phylo::Forest::Tree->new_from_bioperl() method
    • Use Bio::Phylo to validate the nexml file and catch any invalid exceptions that are thrown

June 13th - June 19th

Finish Modules

  • Finish coding the nexml.pm modules (Bio:SeqIO:nexml, Bio:AlignIO:nexml, and Bio:TreeIO:nexml) adding any extra wrapper code that might be neccessary
  • Milestone - Testable nexml.pm modules

June 20th - June 26th

Regression testing

June 27th - July 3rd

Test Data Sets

  • Generate test data sets that will be used for both benchmarking and test use cases

July 4th - July 10th

Benchmark

  • Use BioPerl to query large data sets from GenBank

Validation

  • Use the online NeXML validator at http://www.nexml.org to check the NeXML produced by the new modules

Potential Cases

Start Midterm evaluation

July 11th - July 17th

Use Cases

  • Generate real world examples based on test data sets with detailed documentation that will walk someone step by step through the new wrapper and the functionality of Bio::Phylo from a BioPerl perspective

July 18th - July 24th

Code Optimization

  • Optimize code based on results from benchmark

July 25th - July 31st

Incorporate new module into BioPerl

  • BioPerl-dev – the repository for experimental packages
  • BioPerl-live – the repository for modules slated for core releases (Only will go here if BioPerl Core Devs think that it works well and is useful enough to be packaged as part of the core)

August 1st - August 7th

Debugging and Testing

August 8th - August 10th

Finish documentation and wrap up

August 17th - August 24th

Begin and finish final evaluations

Deliverables

  • Native NeXML integration into BioPerl
  • Full Bio::Phylo functionality from within BioPerl
  • Example files and test cases to facilitate adoption of NeXML