Difference between revisions of "PhyloSoC:Phylogenetic XML"

From Phyloinformatics
Jump to: navigation, search
(Phylogenetic XML <--> Object serialization)
(Phylogenetic XML <--> Object serialization)
Line 14: Line 14:
; Mentors : '''Rutger Vos''', Hilmar Lapp, Bill Piel, Mike Muratet
; Mentors : '''Rutger Vos''', Hilmar Lapp, Bill Piel, Mike Muratet
; Source code: http://code.google.com/p/nexml07gsoc/source
; Source code: http://code.google.com/p/nexml07gsoc/source
=== Schedule ===

Revision as of 13:31, 1 June 2007

Phylogenetic XML <--> Object serialization

Phylogenetic data (i.e. evolutionary trees and the data they are based on) is commonly stored in the NEXUS flat text file format. Unfortunately, this format is fairly difficult to parse due to incompatible "dialects" that have evolved over time. Also, a formal means to validate all flavors of NEXUS does not exist. Several XML formats to replace NEXUS have been proposed, none of which, for practical and historical reasons, have been widely adopted. One reason why none of these efforts have been successful is that they never sought to integrate the proposed format with existing bioinformatics toolkits and the objects they define. Google Summer of Code student Jason Caravas will remedy this and co-design an XML format that:
  1. can express (at least the biological data part of) the NEXUS specification,
  2. can be validated using XML schema,
  3. is flexible enough to allow adding further annotations (e.g. a key/value system for taxa and sequences, verbose annotation),
  4. is optimized for minimal memory requirements and to limit file size for large character matrices or trees (e.g., no deep nestings to obviate the need for stream-based parsers to hold large chunks in memory),
  5. uses uids and id references to link annotations to entities, nodes to taxa, etc.,
  6. is efficiently serialized into objects of one of the commonly used toolkits (see below).
There have been several attempts to create XML formats for phylogenetic data, for example PhyloXML. Several published and unpublished manuscripts take inventory of these, which can serve as a basis to design a parser as well as an optimized XML format definition. The parser would populate BioPerl/Bio::Phylo or Biojava objects, depending on the programming language preference of the student.
Trees as well as character state matrices can get very large. The challenge is to find the right trade-off between granularity, element and id ordering, and nesting in the XML definition - and an optimized parser implementation that is nevertheless robust.
Involved toolkits or projects 
PhyloXML, BioPerl, Bio::Phylo, Biojava, TreeBASE II, pPOD
Jason Caravas
Rutger Vos, Hilmar Lapp, Bill Piel, Mike Muratet
Source code