PhyloSoC:BioPerl integration of the NeXML exchange standard and Bio::Phylo toolkit

From Phyloinformatics
Jump to: navigation, search

2009 Google Summer of Code Project

Title

BioPerl integration of the NeXML exchange standard and Bio::Phylo toolkit

Authors

Student: Chase Miller

Mentors: Mark A. Jensen (primary) Rutger Vos

Abstract

This project will integrate the NeXML exchange standard into BioPerl, facilitating the adoption of this standard and easing the transition from the overworked NEXUS standard. A wrapper will be used to allow BioPerl native access to the preferred NeXML parser (Bio::Phylo), allowing Bio::Phylo and NeXML to co-evolve without being encumbered by BioPerl. Additionally, test cases and example sets will be developed that target real world uses.

Background

Phylogenetic trees, like all visual data representations, require special consideration to represent the complete structure in a linear non-visual data format. The current accepted standard is the NEXUS file format, which is a highly expressive and popular file format. However, it has become increasingly cumbersome to use due to three main reasons: “(i) deficiencies in the standard itself, (ii) variable level of support for standard feature in current software, and (iii) an extremely high incidence of illegal files (a byproduct of the fact that most parsers are ‘forgiving’ of errors)” 1. These problems have sparked a transition away from the NEXUS standard to a more flexible, extensible, and strictly defined standard. One candidate is the NeXML exchange standard, which, in addition to satisfying the goals of being more extensible and more strictly defined, leverages the power of pre-existing XML tools and capabilities (such as parser libraries, web service toolkits, and serialization) to make a more robust product 2. One of the current barriers of adopting the NeXML standard into wider use is the lack of support for NeXML in popular programming tools. One way to lower the entrance barrier for use of the new NeXML standard is to incorporate the preferred NeXML parser (Bio::Phylo) into BioPerl - a popular bioinformatics framework. However, the NeXML standard is rapidly evolving and would be hampered by having to interact directly with the large BioPerl project. To circumvent these potential pitfalls, a lightweight dynamic wrapper will be created around Bio::Phylo. In this way BioPerl users will be given access to the full suite of functionality provided by Bio::Phylo without encumbering the future development of the NeXML standard or the Bio::Phylo module. To facilitate the use of the newly designed wrapper and the NeXML standard, test cases and example sets will be developed that target real world uses. This project will be an important step in the transition away from an overworked standard to the more powerful and extensible NeXML exchange standard, inherently increasing productivity along the way.

Project Plan

May 23rd - June 5th

NeXML to BioPerl Objects

Nexml files can contain three different types of data (trees, alignments, and plain sequences), however BioPerl does not contain a single object that can hold all three datatypes. Therefore, a nexml file will be parsed and mapped to three different Bioperl *IO modules (Bio::SeqIO, Bio::AlignIO, and Bio::TreeIO). *IO modules handle parsing and unparsing making them a logical place to handle a new file format and this approach also emphasizes that this a separate effort with unique utility from the project to add Bio::Phylo native bioperl object handlers. To add nexml support to the *IO modules, a nexml.pm module will be created in the namespace of each *IO module (e.g. Bio::SeqIO::nexml) that will consist of a thin wrapper around Bio::Phylo. Bio::Phylo will be used to parse the nexml files into a Bio::Phylo object and then the Bio::Phylo attributes will be mapped to Bioperl attributes. This is a preliminary step and more comprehensive handling of Nexml files will be added in coming weeks, possibly looking at Bio::Phylo::project.pm (which holds all three data types) for inspiration.
  • Tasks
    • Bio::Phylo::Matrices::Datum to Bio::Seq
      • Create the Bio::SeqIO::nexml.pm module and implement the Bio::SeqIO::nexml::next_seq() method, which will override the Bio::SeqIO::next_seq() method, giving native support for reading nexml files.
      • Determine whether and how to conserve Bio::Phylo data associations that Bio::Phylo explicitly maintains (between sequences and tree leaf nodes, for example)
      • Use Bio::Phylo to validate the nexml file
      • Catch any exceptions during the conversion process
    • Bio::Phylo::Matrices::Matrix to Bio::Align
      • Create the Bio::AlignIO::nexml.pm module and implement the Bio::AlignIO::nexml::next_aln() method, which will override the Bio::AlignIO::next_aln() method, giving native support for reading nexml files
      • Determine whether and how to conserve Bio::Phylo data associations that Bio::Phylo explicitly maintains
      • Use Bio::Phylo to validate the nexml file
      • Catch any exceptions during the conversion process (an example would be if the sequences were different lengths which Bio::Phylo doesn't check for, but Bioperl does)
    • Bio::Phylo::Forest::Tree to Bio::Tree
      • Create the Bio::TreeIO::nexml.pm module and implement the Bio::TreeIO::nexml::next_tree() method, which will override the Bio::TreeIO::next_tree() method, giving native support for reading nexml files
      • Determine whether and how to conserve Bio::Phylo data associations that Bio::Phylo explicitly maintains
      • Use Bio::Phylo to validate the nexml file
      • Catch any exceptions during the conversion process
    • Design Methods to handle the complete reading of a Nexml file
      • Parsing a nexml file into Bioperl could produce up to three different object types (SeqIO, AlignIO, TreeIO). The methods above will handle these separately, but the ultimate goal is to make this easier on the user and allow the full parsing of a nexml file with one pass of the Bio::Phylo parser without forcing the user to parse the file three times - once each for sequences, alignments, and trees. These methods will make use of the methods above (e.g. Bio::SeqIO::next_seq()) to accomplish this.
  • Done 6/1/09
    • Created preliminary version of Bio::SeqIO::nexml->next_seq() method with minimal error handling and preliminary attribute mapping
    • Created preliminary version of Bio::AlignIO::nexml->next_aln() method with minimal error handling and preliminary attribute mapping
    • Started to write (almost done) prelminary version of Bio::TreeIO::nexml->next_tree() method
    • Started discussion on best way to input Nexml data of all three data types (seq, aln, and tree) at once
  • Done 6/5/09
    • Created preliminary version of Bio::TreeIO::nexml->next_tree() method
    • Modified SeqIO::nexml.pm and AlignIO::nexml.pm to better access Bio::Phylo object attributes
    • Determined preliminary concept of new nexml module (maybe Bio::nexml.pm) that will handle the three different types of data represented in nexml files
    • Learned about test::moore module and began thinking about how to write tests for the previously written modules.

June 6th - June 12th

BioPerl Objects to NeXML

Write module to allow BioPerl objects to write data to a Nexml file. The Bioperl attributes will be mapped to Bio::Phylo attributes and then Bio::Phylo will generate a nexml file. As mentioned above, Nexml files can handle three different data types (alignment, tree, and plain sequences). However the modules below will only write one type each. This is a preliminary step and more comprehensive handling of Nexml files will be added in coming weeks possibly making use of the same design a Bio/Phylo/Project.pm, which is a container class for the alignment, sequence, and tree data objects in Bio::Phylo.
  • Bio::Seq to Bio::Phylo::Matrices::Datum
    • Create a Bio::SeqIO::nexml.pm module that overwrites the write_seq() method using the code from the Bio::Phylo::Matrices::Datum->new_from_bioperl() method
    • Use Bio::Phylo to validate the nexml file and catch any invalid exceptions that are thrown
  • Bio::Align to Bio::Phylo::Matrices::Matrix
    • Create a Bio::AlignIO::nexml.pm module that overwrites the write_aln() method using the code from the Bio::Phylo::Matrices::Matrix->new_from_bioperl() method
    • Use Bio::Phylo to validate the nexml file and catch any invalid exceptions that are thrown
  • Bio::Tree to Bio::Phylo::Forest::Tree
    • Create a Bio::TreeIO::nexml.pm module that overwrites the write_tree() method using the code from the Bio::Phylo::Forest::Tree->new_from_bioperl() method
    • Use Bio::Phylo to validate the nexml file and catch any invalid exceptions that are thrown
  • Setup Bio::Nexml class according to design
    • Continue designing Bio::Nexml module to handle the complete writing of a nexml file
  • Done 6/12/09
    • Implemented the Bio::Seq::write_seq() method using the code from Bio::Phylo::Matrices::Datum::new_from_bioperl() code. Updated it to make use of the SeqFeatures module for the start and strand attributes since those attributes appear to have been deprecated in Bio::Seq.
    • Implemented Bio::AlignIO::write_aln(), however because of type checking in Bio::Phylo it will not be functional (currently throws an exception) until the taxa associated with the alignment can be linked in bioperl and then passed on to the Bio::Phylo unparser.
    • Implemented Bio::TreeIO::write_tree().
    • Setup module (Bio::Nexml) to handle an entire nexml document.

June 13th - June 19th

Create Module (Bio::Nexml) that will act as a representation of an entire Nexml document and will be a container class for seq, aln, and tree objects. It will make use of the modules/methods previously written during this project.

  • determine best way to maintain data relationships - this can be handled by the container class (Bio::Nexml) or make use of other bioperl classes (perhaps Bio::Annotation::Tree)
  • Working Plan
  • Write tests using Test::Moore
  • Milestone - Testable Bio::*IO::nexml.pm modules

Done June 19th

  • Wrote test scripts for all modules, totaling 56 tests so far
  • Created Bio::Nexml::Util module to increase efficiency
  • Added some sanity checks to make sure incoming data is what I think it is
  • Continued to add to Bio::Nexml module design

June 20th - June 26th

Create Method/Module for handling the complete writing of a Nexml file (i.e. write data from AlignIO, SeqIO, and TreeIO objects to one nexml file)

  • Refactor unparsing methods from the Bio::*IO::nexml modules into Bio::Nexml::Util module.
  • write more tests, focusing on the Bio::Nexml::Util module and the uparsing functionality
  • With a lot of the tests and sanity checks out of the way, we can now focus on relating the various data (sequencs, alignments, trees, nodes, and taxa) to each other
  • Here's an in progress design plan for the Bio::nexml module
  • Milestone - all modules/methods are now testable

Done 6/28/09

  • Refactored unparsing methods from Bio::*IO::nexml modules to Bio::Nexml::Util module
  • Wrote tests for writing files totaling more than 45
  • Related taxa to trees and taxons to nodes so that they can be parsed into bioperl and back into nexml maintaining the relationships
  • Began writing Bio::Nexml::write_doc method that will be able to handle multiple bioperl objects (including trees, alignments, and sequences) and write them to a single valid nexml document

June 27th - July 3rd

Giving talk at International Society of Developmental and Comparative Immunology (ISDCI) conference in Prague.

July 4th - July 10th

Finishing Bio::Nexml module

  • Continue writing Bio::Nexml module
    • Add more tests
    • Add any sanity checks that have not been added yet
    • Make sure all data are linked if they should be. See if module can handle a roundtrip test of a document.

Start Midterm evaluation

Done 7/10/09

  • Added more tests for writing and reading entire Nexml document (32 total)
  • Wrote the majority of the Bio::Nexml::write_doc method that handles entire Nexml documents
  • Refined linking of taxa to trees so that multiple bioperl trees can share identical taxa when converted to bio phylo forests then outputted to nexml
  • Cleaned up and reworked the Bio::Nexml::Util module

July 11th - July 17th

  • Handle final few linking issues with Bio::Nexml module

Benchmark

  • Use BioPerl to query large data sets from GenBank

Validation

  • Use the online NeXML validator at http://www.nexml.org to check the NeXML produced by the new modules

Potential Cases

Done 7/17/09

  • Fixed some small bugs
  • Improved linking of Taxa data to Bio::Tree::Tree objects
  • Improved linking of Taxa data to Bio::SSSSSSSS
  • Benchmarked modules - Results

July 18th - July 24th

Use Cases

  • Generate real world examples based on test data sets with detailed documentation that will walk someone step by step through the new wrapper and the functionality of Bio::Phylo from a BioPerl perspective

Fix some taxa linking bugs

Done 7/24/09

  • Generated use case via Nexml HOWTO
  • Have taxa linking working correctly, although still needs a little working
  • Worked out final goals with Mentor (Mark Jensen)

July 25th - July 31st

  • Finish taxa linking
  • Flesh out POD documentation for each method
  • Change names of some methods to better reflect what they do and to be more consistent
  • Continue to add to the HOWTO

Done

  • Added pod documentation to all complete methods
  • Turned Bio::Nexml::Util into a true factory and changed the name to Bio::Nexml::Factory
  • Partially Refactored the Bio::Nexml module to deal with object (e.g. taxa and trees) linking better
  • Revised the HOWTO

August 1st - August 7th

Documentation and Code Refactoring

  • Continue to refactor the Bio::NexmlIO module to better handle linking
  • Finish pod documentation on refactored methods

Done

  • Finished all coding including the refactor of the Bio::NexmlIO module
  • Added tests focusing on taxa linking bringing the total number of tests to greater than 200
  • Finished POD documentation to all methods and fleshed out Description and Synopsis of all modules

August 8th - August 17th

Incorporate into Bioperl-live

August 17th - August 24th

Begin and finish final evaluations

Deliverables

  • Native NeXML integration into BioPerl
  • Full Bio::Phylo functionality from within BioPerl
  • Example files and test cases to facilitate adoption of NeXML

ToDos

  • Implement true streaming - to allow better memory efficiency
  • Support arbitrary genotypes