PhyloSoC:Phylogenetic XML

From Phyloinformatics
Revision as of 12:39, 1 June 2007 by RutgerVos (talk) (Phylogenetic XML <--> Object serialization)
Jump to: navigation, search

Phylogenetic XML <--> Object serialization

Phylogenetic data (i.e. evolutionary trees and the data they are based on) is commonly stored in the NEXUS flat text file format. Unfortunately, this format is fairly difficult to parse due to incompatible "dialects" that have evolved over time. Also, a formal means to validate all flavors of NEXUS does not exist. Several XML formats to replace NEXUS have been proposed, none of which, for practical and historical reasons, have been widely adopted. One reason why none of these efforts have been successful is that they never sought to integrate the proposed format with existing bioinformatics toolkits and the objects they define. We seek a student who will remedy this and design an XML format that:
  1. can express (at least the biological data part of) the NEXUS specification,
  2. can be validated using DTD or XML schema,
  3. is flexible enough to allow adding further annotations (e.g. a key/value system for taxa and sequences),
  4. is optimized for minimal memory requirements and to limit file size for large character matrices or trees (e.g., no deep nestings to obviate the need for stream-based parsers to hold large chunks in memory),
  5. uses uids and id references to link annotations to entities, nodes to taxa, etc.,
  6. is efficiently serialized into objects of one of the commonly used toolkits (see below).
See also the announcement on PerlMonks.
There have been several attempts to create XML formats for phylogenetic data, for example PhyloXML. Several published and unpublished manuscripts take inventory of these, which can serve as a basis to design a parser as well as an optimized XML format definition. The parser would populate BioPerl/Bio::Phylo or Biojava objects, depending on the programming language preference of the student.
Trees as well as character state matrices can get very large. The challenge is to find the right trade-off between granularity, element and id ordering, and nesting in the XML definition - and an optimized parser implementation that is nevertheless robust.
Involved toolkits or projects 
PhyloXML, BioPerl, Bio::Phylo, Biojava, TreeBASE II, pPOD
Rutger Vos, Hilmar Lapp, Bill Piel, Mike Muratet
Source code