Phylogenetic XML <--> Object serialization
Phylogenetic data (i.e. evolutionary trees and the data they are based on) is commonly stored in the NEXUS flat text file format. Unfortunately, this format is fairly difficult to parse due to incompatible "dialects" that have evolved over time. Also, a formal means to validate all flavors of NEXUS does not exist. Several XML formats to replace NEXUS have been proposed, none of which, for practical and historical reasons, have been widely adopted. One reason why none of these efforts have been successful is that they never sought to integrate the proposed format with existing bioinformatics toolkits and the objects they define. Google Summer of Code student Jason Caravas will remedy this and co-design an XML format that:
- can express (at least the biological data part of) the NEXUS specification,
- can be validated using XML schema,
- is flexible enough to allow adding further annotations (e.g. a key/value system for taxa and sequences, verbose annotation),
- is optimized for minimal memory requirements and to limit file size for large character matrices or trees (e.g., no deep nestings to obviate the need for stream-based parsers to hold large chunks in memory),
- uses uids and id references to link annotations to entities, nodes to taxa, etc.,
- is efficiently serialized into objects of one of the commonly used toolkits (see below).
There have been several attempts to create XML formats for phylogenetic data, for example PhyloXML. Several published and unpublished manuscripts take inventory of these, which can serve as a basis to design a parser as well as an optimized XML format definition. The parser would populate BioPerl/Bio::Phylo or Biojava objects, depending on the programming language preference of the student.
- Trees as well as character state matrices can get very large. The challenge is to find the right trade-off between granularity, element and id ordering, and nesting in the XML definition - and an optimized parser implementation that is nevertheless robust.
- Involved toolkits or projects
- PhyloXML, BioPerl, Bio::Phylo, Biojava, TreeBASE II, pPOD
- Jason Caravas
- Rutger Vos, Hilmar Lapp, Bill Piel, Mike Muratet
- Source code
Set up IDE, project page, code repository, explore APIs for Bio::Phylo, XML::Twig, nexml schema
Write example parsing scripts that process verbose nexml and spawn appropriate Bio::Phylo objects
otus block -> Bio::Phylo::Taxa object populated with Taxon objects characters block -> Bio::Phylo::Matrices object populated with Matrix objects trees block -> Bio::Phylo::Forest object populated with Tree objects
Write example xml serializers to convert Bio::Phylo objects to valid XML.
Bio::Phylo::Taxa object -> otus block populated with otu data Bio::Phylo::Matrices object -> characters block populated with matrix data Bio::Phylo::Forest object -> trees block populated with trees
Refactor wk2 and wk3 scripts into Bio::Phylo::Parsers::Nexml and Bio::Phylo::Unparsers::Nexml, add hooks in Bio::Phylo::IO
Create test data sets containing NEXUS data
Valid and invalid nexml format
Write tests to validate example files
Turn into *.t test suite
Write documentation for parser and unparser
Design additional nexml features: sets
Sets will organize and compartmentalize data, allowing one nexml file to be used for
multiple independent analysis
Refactor Bio::Phylo to accommodate sets
Update parsers to store set data
Debugging and testing
Prepare new Bio::Phylo release incorporating new xml features