PhyloSoC:BioPerl integration of the NeXML exchange standard and Bio::Phylo toolkit

From Phyloinformatics
Revision as of 12:37, 26 May 2009 by Chmille4 (talk) ('''June 13th - June 19th''')
Jump to: navigation, search

2009 Google Summer of Code Project


BioPerl integration of the NeXML exchange standard and Bio::Phylo toolkit


Student: Chase Miller

Mentors: Mark A. Jensens (primary) Rutger Vos

Project Blog


This project will integrate the NeXML exchange standard into BioPerl, facilitating the adoption of this standard and easing the transition from the overworked NEXUS standard. A wrapper will be used to allow BioPerl native access to the preferred NeXML parser (Bio::Phylo), allowing Bio::Phylo and NeXML to co-evolve without being encumbered by BioPerl. Additionally, test cases and example sets will be developed that target real world uses.


Phylogenetic trees, like all visual data representations, require special consideration to represent the complete structure in a linear non-visual data format. The current accepted standard is the NEXUS file format, which is a highly expressive and popular file format. However, it has become increasingly cumbersome to use due to three main reasons: “(i) deficiencies in the standard itself, (ii) variable level of support for standard feature in current software, and (iii) an extremely high incidence of illegal files (a byproduct of the fact that most parsers are ‘forgiving’ of errors)” 1. These problems have sparked a transition away from the NEXUS standard to a more flexible, extensible, and strictly defined standard. One candidate is the NeXML exchange standard, which, in addition to satisfying the goals of being more extensible and more strictly defined, leverages the power of pre-existing XML tools and capabilities (such as parser libraries, web service toolkits, and serialization) to make a more robust product 2. One of the current barriers of adopting the NeXML standard into wider use is the lack of support for NeXML in popular programming tools. One way to lower the entrance barrier for use of the new NeXML standard is to incorporate the preferred NeXML parser (Bio::Phylo) into BioPerl - a popular bioinformatics framework. However, the NeXML standard is rapidly evolving and would be hampered by having to interact directly with the large BioPerl project. To circumvent these potential pitfalls, a lightweight dynamic wrapper will be created around Bio::Phylo. In this way BioPerl users will be given access to the full suite of functionality provided by Bio::Phylo without encumbering the future development of the NeXML standard or the Bio::Phylo module. To facilitate the use of the newly designed wrapper and the NeXML standard, test cases and example sets will be developed that target real world uses. This project will be an important step in the transition away from an overworked standard to the more powerful and extensible NeXML exchange standard, inherently increasing productivity along the way.

Project Plan

May 23rd - June 5th

BioPerl Objects to NeXML

  • Write module for converting BioPerl data into the NeXML standard by implementing the BioPerl objects interfaces using Bio::Phylo APIs.

Bio::Seq to Bio::Phylo::Matrices::Datum

  • Create a module that overwrites the write_seq() method using the Bio::Phylo::Matrices::Datum->new_from_bioperl() method

Bio::Align to Bio::Phylo::Matrices::Matrix

  • Create a module that overwrites the write_aln() method using the Bio::Phylo::Matrices::Matrix->new_from_bioperl() method

Bio::Tree to Bio::Phylo::Forest::Tree

  • Create a module that overwrites the write_tree() method using the Bio::Phylo::Forest::Tree->new_from_bioperl() method

June 6th - June 12th

NeXML to BioPerl Objects

  • Continue to write the modules, focusing on implementing the methods from the BioPerl objects interfaces needed for reading a NeXML file.

Bio::Phylo::Matrices::Datum to Bio::Seq

  • Implement the Bio::SeqIO::nexml::next_seq() method, which will override the Bio::SeqIO::next_seq() method, giving native support for reading nexml files

Bio::Phylo::Matrices::Matrix to Bio::Align

  • Implement the Bio::AlignIO::nexml::next_aln() method, which will override the Bio::AlignIO::next_seq() method, giving native support for reading nexml files

Bio::Phylo::Forest::Tree to Bio::Tree

  • Implement the Bio::TreeIO::nexml::next_tree() method, which will override the Bio::TreeIO::next_seq() method, giving native support for reading nexml files

June 13th - June 19th

Finish Modules

  • Finish coding the modules (Bio:SeqIO:nexml, Bio:AlignIO:nexml, and Bio:TreeIO:nexml) adding any extra wrapper code that might be neccessary
  • Milestone - Testable modules

June 20th - June 26th

Regression testing

June 27th - July 3rd

Test Data Sets

  • Generate test data sets that will be used for both benchmarking and test use cases

July 4th - July 10th


  • Use BioPerl to query large data sets from GenBank


  • Use the online NeXML validator at to check the NeXML produced by the new modules

Potential Cases

Start Midterm evaluation

July 11th - July 17th

Use Cases

  • Generate real world examples based on test data sets with detailed documentation that will walk someone step by step through the new wrapper and the functionality of Bio::Phylo from a BioPerl perspective

July 18th - July 24th

Code Optimization

  • Optimize code based on results from benchmark

July 25th - July 31st

Incorporate new module into BioPerl

  • BioPerl-dev – the repository for experimental packages
  • BioPerl-live – the repository for modules slated for core releases (Only will go here if BioPerl Core Devs think that it works well and is useful enough to be packaged as part of the core)

August 1st - August 7th

Debugging and Testing

August 8th - August 10th

Finish documentation and wrap up

August 17th - August 24th

Begin and finish final evaluations


  • Native NeXML integration into BioPerl
  • Full Bio::Phylo functionality from within BioPerl
  • Example files and test cases to facilitate adoption of NeXML