BioRuby PhyloXML HowTo documentation
From Phyloinformatics
Contents
Intro
PhyloXML is an XML language for saving, analyzing and exchanging data of annotated phylogenetic trees. PhyloXML parser in BioRuby is implemented in Bio::PhyloXML::Parser and writer in Bio::PhyloXML::Writer. More information at www.phyloxml.org
Requirements
In addition to BioRuby library you need a libxml ruby bindings. To install:
gem install -r libxml-ruby
How to parse a file
require 'bio'
# Create new phyloxml parser phyloxml = Bio::PhyloXML::Parser.new('example.xml')
# Print the names of all trees in the file phyloxml.each do |tree| puts tree.name end
If there are several trees in the file, you can access the one you wish by an index
tree = phyloxml[3]
You can use all Bio::Tree methods on the tree. For example,
tree.leaves.each do |node| puts node.name end
PhyloXML files can hold additional information besides phylogenies at the end of the file. This info can be accessed through the 'other' array of the parser object.
phyloxml = Bio::PhyloXML::Parser.new('example.xml') while tree = phyloxml.next_tree # do stuff with trees end puts phyloxml.other
How to write a file
# Create new phyloxml writer writer = Bio::PhyloXML::Writer.new('tree.xml')
# Write tree to the file tree.xml writer.write(tree1)
# Add another tree to the file writer.write(tree2)
How to retrieve data
Here is an example of how to retrieve the scientific name of the clades.
require 'bio'
phyloxml = Bio::PhyloXML::Parser.new('ncbi_taxonomy_mollusca.xml') phyloxml.each do |tree| tree.each_node do |node| print "Scientific name: ", node.taxonomies[0].scientific_name, "\n" end end
How to retrieve 'other' data
require 'bio'
phyloxml = Bio::PhyloXML::Parser.new('phyloxml_examples.xml') while tree = phyloxml.next_tree #do something with the trees end
p phyloxml.other puts "\n" #=> output is an object representation
#Print in a readable way puts phyloxml.other[0].to_xml, "\n" #=>: # #<align:alignment xmlns:align="http://example.org/align"> # <seq name="A">acgtcgcggcccgtggaagtcctctcct</seq> # <seq name="B">aggtcgcggcctgtggaagtcctctcct</seq> # <seq name="C">taaatcgc--cccgtgg-agtccc-cct</seq> #</align:alignment>
#Once we know whats there, lets output just sequences phyloxml.other[0].children.each do |node| puts node.value end #=> # #acgtcgcggcccgtggaagtcctctcct #aggtcgcggcctgtggaagtcctctcct #taaatcgc--cccgtgg-agtccc-cct