PhyloSoC:BioPerl integration of the NeXML exchange standard and Bio::Phylo toolkit/HOWTO
This is a HOWTO about the Template:PM module, and how to use it to read and write complete Nexml documents. We will also describe how the Template:PM, Template:PM, and Template:PM modules work for outputting individual objects to nexml.
- 1 Abstract
- 2 Authors
- 3 Introduction
- 4 Design
- 5 Example Nexml Documents
- 6 Reading/Writing Entire Nexml Documents
- 7 Reading/Writing Individual Datatypes (e.g. trees)
- 8 Data Degradation from Bio::Phylo
These nexml modules integrate the NeXML exchange standard into BioPerl, facilitating the adoption of this standard and easing the transition from the overworked NEXUS standard. A wrapper was used to allow BioPerl native access to the preferred NeXML parser (Bio::Phylo), allowing Bio::Phylo and NeXML to co-evolve without being encumbered by BioPerl
Nexml functionality in bioperl consists of four modules that allow the user to interact with nexml data in two different ways. Bio::nexml allows users to read/write an entire nexml document, whereas Bio::SeqIO::nexml, Bio::AlignIO::nexml, and Bio::TreeIO::nexml allow the user to only read/write one data type (seqs, alns, or trees, respectively).
Example Nexml Documents
Nexml documents to use with the example code can be found at http://www.nexml.org/nexml/examples/
Reading/Writing Entire Nexml Documents
Reading and writing a whole nexml document is accomplished with the Bio::Nexml module. The Bio::Nexml module can read a nexml document and maintain many of the data associations allowable by Bio::Phylo (however at this point not all data associations are maintained[link to list]). Once read the data can be converted into BioPerl objects (i.e Bio::Tree::Tree, Bio::SimpleAlign, and Bio::Seq) and manipulated before writing back to a nexml document.
#Instantiate a Bio::Nexml object and link it to a file my $in_nexml = Bio::Nexml->new(-file => 'nexml_doc.xml', -format => 'Nexml'); #Read in some data my $bptree1 = $in_nexml->next_tree(); my $bpaln1 = $in_nexml->next_aln(); my $bpseq1 = $in_nexml->next_seq(); #Use/manipulate data ... #Write data to nexml file my $out_nexml = Bio::Nexml->new(-file => '>new_nexml_doc.xml', -format => 'Nexml'); $out_nexml->to_xml();
Reading/Writing Individual Datatypes (e.g. trees)
Sometimes it may be preferable to only work with a single data type. (Also this maintains uniformity with parsers being located within the the Bio::*IO modules.) In these cases the use of the Bio::*IO::nexml modules (Bio::TreeIO::nexml, Bio::AlignIO::nexml, or Bio::SeqIO::nexml) are available.
Read/Write a tree
#Create stream object my $TreeStream = Bio::TreeIO->new(-file => 'trees.xml', -format => 'Nexml'); #Read and convert first tree to BioPerl Bio::Tree::Tree object my $tree_obj = $TreeStream->next_tree(); #Use/manipulate tree data (e.g.) my @nodes = $tree_obj->get_nodes(); ... #Convert and output BioPerl tree object to nexml my $outTree = Bio::TreeIO->new(-file => '>trees_out.xml', -format => 'nexml'); $outTree->write_tree($tree_obj);
Read/Write an alignment
#Create stream object my $AlnStream = Bio::AlignIO->new(-file => 'characters.xml', -format => 'Nexml'); #Read and convert first tree to BioPerl Bio::SimpleAlign object my $aln_obj = $AlnStream->next_aln(); #Use/manipulate tree data (e.g.) ... #Convert and output BioPerl alignment object to nexml my $outAln = Bio::AlignIO->new(-file => '>aln_out.xml', -format => 'nexml'); $outAln->write_aln($aln_obj);
It's important to note that the xml output does not make a complete nexml document, but only outputs the tree/alignment in nexml format. (This can be changed easily, I'm not sure if it's better to output a whole document or just pieces which possible could allow someone to build their own nexml document from BioPerl, although getting the syntax right would be very difficult.)
Data Degradation from Bio::Phylo
Some things in Bio::Phylo were not implemented in Bioperl because it was either very diffcult or did not make sense to do so
- Taxa and Bio::Phylo::Tree objects
- Taxa and Bio::Phylo::Matrices::Datum objects (sequences)
- Taxa and Bio::Phylo::Matrices::Matrix objects (alignments)
Associations Not Maintained (although maybe soon)
- Sequences and Nodes
- Alignments and Trees
Alignments and Sequences of odd type
Nexml is a robust standard and can represent wide ranging types of data. Some types of data did not make sense to convert to BioPerl and is therefore ignored during the parsing process.
Nexml allows Bio::Phylo::Matrices::Matrix objects (i.e. alignments) and Bio::Phylo::Matrices::Datum objects (i.e. sequences) to represent data that is not DNA, RNA, or Protein. These Matrix objects are not converted during the parsing process.