PhyloSoC:BioPerl integration of the NeXML exchange standard and Bio::Phylo toolkit

From Phyloinformatics
Revision as of 09:48, 8 June 2009 by Chmille4 (talk) ('''June 6th - June 12th''')
Jump to: navigation, search

2009 Google Summer of Code Project

Title

BioPerl integration of the NeXML exchange standard and Bio::Phylo toolkit

Authors

Student: Chase Miller

Mentors: Mark A. Jensens (primary) Rutger Vos

Project Blog

http://gsocnexml.blogspot.com/

Abstract

This project will integrate the NeXML exchange standard into BioPerl, facilitating the adoption of this standard and easing the transition from the overworked NEXUS standard. A wrapper will be used to allow BioPerl native access to the preferred NeXML parser (Bio::Phylo), allowing Bio::Phylo and NeXML to co-evolve without being encumbered by BioPerl. Additionally, test cases and example sets will be developed that target real world uses.

Background

Phylogenetic trees, like all visual data representations, require special consideration to represent the complete structure in a linear non-visual data format. The current accepted standard is the NEXUS file format, which is a highly expressive and popular file format. However, it has become increasingly cumbersome to use due to three main reasons: “(i) deficiencies in the standard itself, (ii) variable level of support for standard feature in current software, and (iii) an extremely high incidence of illegal files (a byproduct of the fact that most parsers are ‘forgiving’ of errors)” 1. These problems have sparked a transition away from the NEXUS standard to a more flexible, extensible, and strictly defined standard. One candidate is the NeXML exchange standard, which, in addition to satisfying the goals of being more extensible and more strictly defined, leverages the power of pre-existing XML tools and capabilities (such as parser libraries, web service toolkits, and serialization) to make a more robust product 2. One of the current barriers of adopting the NeXML standard into wider use is the lack of support for NeXML in popular programming tools. One way to lower the entrance barrier for use of the new NeXML standard is to incorporate the preferred NeXML parser (Bio::Phylo) into BioPerl - a popular bioinformatics framework. However, the NeXML standard is rapidly evolving and would be hampered by having to interact directly with the large BioPerl project. To circumvent these potential pitfalls, a lightweight dynamic wrapper will be created around Bio::Phylo. In this way BioPerl users will be given access to the full suite of functionality provided by Bio::Phylo without encumbering the future development of the NeXML standard or the Bio::Phylo module. To facilitate the use of the newly designed wrapper and the NeXML standard, test cases and example sets will be developed that target real world uses. This project will be an important step in the transition away from an overworked standard to the more powerful and extensible NeXML exchange standard, inherently increasing productivity along the way.

Project Plan

May 23rd - June 5th

NeXML to BioPerl Objects

Nexml files can contain three different types of data (trees, alignments, and plain sequences), however BioPerl does not contain a single object that can hold all three datatypes. Therefore, a nexml file will be parsed and mapped to three different Bioperl *IO modules (Bio::SeqIO, Bio::AlignIO, and Bio::TreeIO). *IO modules handle parsing and unparsing making them a logical place to handle a new file format and this approach also emphasizes that this a separate effort with unique utility from the project to add Bio::Phylo native bioperl object handlers. To add nexml support to the *IO modules, a nexml.pm module will be created in the namespace of each *IO module (e.g. Bio::SeqIO::nexml) that will consist of a thin wrapper around Bio::Phylo. Bio::Phylo will be used to parse the nexml files into a Bio::Phylo object and then the Bio::Phylo attributes will be mapped to Bioperl attributes. This is a preliminary step and more comprehensive handling of Nexml files will be added in coming weeks, possibly looking at Bio::Phylo::project.pm (which holds all three data types) for inspiration.
  • Tasks
    • Bio::Phylo::Matrices::Datum to Bio::Seq
      • Create the Bio::SeqIO::nexml.pm module and implement the Bio::SeqIO::nexml::next_seq() method, which will override the Bio::SeqIO::next_seq() method, giving native support for reading nexml files.
      • Determine whether and how to conserve Bio::Phylo data associations that Bio::Phylo explicitly maintains (between sequences and tree leaf nodes, for example)
      • Use Bio::Phylo to validate the nexml file
      • Catch any exceptions during the conversion process
    • Bio::Phylo::Matrices::Matrix to Bio::Align
      • Create the Bio::AlignIO::nexml.pm module and implement the Bio::AlignIO::nexml::next_aln() method, which will override the Bio::AlignIO::next_aln() method, giving native support for reading nexml files
      • Determine whether and how to conserve Bio::Phylo data associations that Bio::Phylo explicitly maintains
      • Use Bio::Phylo to validate the nexml file
      • Catch any exceptions during the conversion process (an example would be if the sequences were different lengths which Bio::Phylo doesn't check for, but Bioperl does)
    • Bio::Phylo::Forest::Tree to Bio::Tree
      • Create the Bio::TreeIO::nexml.pm module and implement the Bio::TreeIO::nexml::next_tree() method, which will override the Bio::TreeIO::next_tree() method, giving native support for reading nexml files
      • Determine whether and how to conserve Bio::Phylo data associations that Bio::Phylo explicitly maintains
      • Use Bio::Phylo to validate the nexml file
      • Catch any exceptions during the conversion process
    • Design Methods to handle the complete reading of a Nexml file
      • Parsing a nexml file into Bioperl could produce up to three different object types (SeqIO, AlignIO, TreeIO). The methods above will handle these separately, but the ultimate goal is to make this easier on the user and allow the full parsing of a nexml file with one pass of the Bio::Phylo parser without forcing the user to parse the file three times - once each for sequences, alignments, and trees. These methods will make use of the methods above (e.g. Bio::SeqIO::next_seq()) to accomplish this.
  • Done 6/1/09
    • Created preliminary version of Bio::SeqIO::nexml->next_seq() method with minimal error handling and preliminary attribute mapping
    • Created preliminary version of Bio::AlignIO::nexml->next_aln() method with minimal error handling and preliminary attribute mapping
    • Started to write (almost done) prelminary version of Bio::TreeIO::nexml->next_tree() method
    • Started discussion on best way to input Nexml data of all three data types (seq, aln, and tree) at once
  • Done 6/5/09
    • Created preliminary version of Bio::TreeIO::nexml->next_tree() method
    • Modified SeqIO::nexml.pm and AlignIO::nexml.pm to better access Bio::Phylo object attributes
    • Determined preliminary concept to integrate the three nexml.pm modules to be able handle the three different types of data represented in nexml files all at once.
    • Learned about test::moore module and began thinking about how to write tests for the previously written modules.

June 6th - June 12th

BioPerl Objects to NeXML

Write module to allow BioPerl objects to write data to a Nexml file. The Bioperl attributes will be mapped to Bio::Phylo attributes and then Bio::Phylo will generate a nexml file. As mentioned above, Nexml files can handle three different data types (alignment, tree, and plain sequences). However the modules below will only write one type each. This is a preliminary step and more comprehensive handling of Nexml files will be added in coming weeks possibly making use of the same design a Bio/Phylo/Project.pm, which is a container class for the alignment, sequence, and tree data objects in Bio::Phylo.
  • Bio::Seq to Bio::Phylo::Matrices::Datum
    • Create a Bio::SeqIO::nexml.pm module that overwrites the write_seq() method using the code from the Bio::Phylo::Matrices::Datum->new_from_bioperl() method
    • Use Bio::Phylo to validate the nexml file and catch any invalid exceptions that are thrown
  • Bio::Align to Bio::Phylo::Matrices::Matrix
    • Create a Bio::AlignIO::nexml.pm module that overwrites the write_aln() method using the code from the Bio::Phylo::Matrices::Matrix->new_from_bioperl() method
    • Use Bio::Phylo to validate the nexml file and catch any invalid exceptions that are thrown
  • Bio::Tree to Bio::Phylo::Forest::Tree
    • Create a Bio::TreeIO::nexml.pm module that overwrites the write_tree() method using the code from the Bio::Phylo::Forest::Tree->new_from_bioperl() method
    • Use Bio::Phylo to validate the nexml file and catch any invalid exceptions that are thrown
  • Continue designing nexml module to handle the complete writing of a nexml file
Nexml files have the ability to describe sequences, alignments, and trees, therefore a user should be able to write a nexml file with all three from Bioperl (i.e. write data from AlignIO, SeqIO, and TreeIO objects to one nexml file). The design of this method will use the methods above (e.g Bio::SeqIO::nexml->write_seq()) to accomplish this.

June 13th - June 19th

Finish Modules

  • Finish coding the nexml.pm modules (Bio:SeqIO:nexml, Bio:AlignIO:nexml, and Bio:TreeIO:nexml) adding any extra wrapper code that might be neccessary
  • Make sure all attributes that can be mapped from nexml are mapped to bioperl objects
  • Finalize design plan for creating nexml module that will act as a container for the three different data types (sequences, alignments, and trees).
  • Milestone - Testable nexml.pm modules

June 20th - June 26th

Create Method/Module for handling the complete reading of a Nexml file (i.e. reading plain sequences, alignments and trees all at once)

June 27th - July 3rd

Create Method/Module for handling the complete writing of a Nexml file (i.e. write data from AlignIO, SeqIO, and TreeIO objects to one nexml file)

  • Working Plan
  • Milestone - all modules/methods are now testable

July 4th - July 10th

Regression testing

Test Data Sets

  • Generate test data sets that will be used for both benchmarking and test use cases

Start Midterm evaluation

July 11th - July 17th

Benchmark

  • Use BioPerl to query large data sets from GenBank

Validation

  • Use the online NeXML validator at http://www.nexml.org to check the NeXML produced by the new modules

Potential Cases

July 18th - July 24th

Use Cases

  • Generate real world examples based on test data sets with detailed documentation that will walk someone step by step through the new wrapper and the functionality of Bio::Phylo from a BioPerl perspective

July 25th - July 31st

Code Optimization

  • Optimize code based on results from benchmark

August 1st - August 7th

Incorporate new module into BioPerl

  • BioPerl-dev – the repository for experimental packages
  • BioPerl-live – the repository for modules slated for core releases (Only will go here if BioPerl Core Devs think that it works well and is useful enough to be packaged as part of the core)

Debugging and Testing

August 8th - August 10th

Finish documentation and wrap up

August 17th - August 24th

Begin and finish final evaluations

Deliverables

  • Native NeXML integration into BioPerl
  • Full Bio::Phylo functionality from within BioPerl
  • Example files and test cases to facilitate adoption of NeXML