PhyloSoC:Phylogenetic XML

From Phyloinformatics
Jump to: navigation, search

Phylogenetic XML <--> Object serialization

Rationale

Phylogenetic data (i.e. evolutionary trees and the data they are based on) is commonly stored in the NEXUS flat text file format. Unfortunately, this format is fairly difficult to parse due to incompatible "dialects" that have evolved over time. Also, a formal means to validate all flavors of NEXUS does not exist. Several XML formats to replace NEXUS have been proposed, none of which, for practical and historical reasons, have been widely adopted. One reason why none of these efforts have been successful is that they never sought to integrate the proposed format with existing bioinformatics toolkits and the objects they define. Google Summer of Code student Jason Caravas will remedy this and co-design an XML format that:

  1. can express (at least the biological data part of) the NEXUS specification,
  2. can be validated using XML schema,
  3. is flexible enough to allow adding further annotations (e.g. a key/value system for taxa and sequences, verbose annotation),
  4. is optimized for minimal memory requirements and to limit file size for large character matrices or trees (e.g., no deep nestings to obviate the need for stream-based parsers to hold large chunks in memory),
  5. uses uids and id references to link annotations to entities, nodes to taxa, etc.,
  6. is efficiently serialized into objects of one of the commonly used toolkits (see below).

Approach

There have been several attempts to create XML formats for phylogenetic data, for example PhyloXML. Several published and unpublished manuscripts take inventory of these, which can serve as a basis to design a parser as well as an optimized XML format definition. The parser would populate BioPerl/Bio::Phylo or Biojava objects, depending on the programming language preference of the student.

Challenges 
Trees as well as character state matrices can get very large. The challenge is to find the right trade-off between granularity, element and id ordering, and nesting in the XML definition - and an optimized parser implementation that is nevertheless robust.
Involved toolkits or projects 
PhyloXML, BioPerl, Bio::Phylo, Biojava, TreeBASE II, pPOD
Student
Jason Caravas
Mentors 
Rutger Vos, Hilmar Lapp, Bill Piel, Mike Muratet
Source code
http://code.google.com/p/nexml07gsoc/source

Schedule

Week 1

Set up IDE, project page, code repository, explore APIs for Bio::Phylo, XML::Twig, nexml schema

Week 2

Write example parsing scripts that process verbose nexml and spawn appropriate Bio::Phylo objects

  otus block -> Bio::Phylo::Taxa object populated with Taxon objects
  characters block -> Bio::Phylo::Matrices object populated with Matrix objects
  trees block -> Bio::Phylo::Forest object populated with Tree objects

Week 3

Write example xml serializers to convert Bio::Phylo objects to valid XML.

 Bio::Phylo::Taxa object -> otus block populated with otu data
 Bio::Phylo::Matrices object -> characters block populated with matrix data
 Bio::Phylo::Forest object -> trees block populated with trees

Week 4

Refactor wk2 and wk3 scripts into Bio::Phylo::Parsers::Nexml and Bio::Phylo::Unparsers::Nexml, add hooks in Bio::Phylo::IO

Week 5

Create test data sets containing NEXUS data

  Valid and invalid nexml format

Week 6

Write tests to validate example files
Turn into *.t test suite

Week 7

Write documentation for parser and unparser

Week 8

Design additional nexml features: sets

  Sets will organize and compartmentalize data, allowing one nexml file to be used for
multiple independent analysis

Week 9

Refactor Bio::Phylo to accommodate sets
Update parsers to store set data

Week 10

Debugging and testing

Week 11

Prepare new Bio::Phylo release incorporating new xml features

Week 12

Wrap up