PhyloSoC:PhyloXML support in BioRuby

From Phyloinformatics
Jump to: navigation, search

Author

Diana Jaunzeikare, Smith College, [latvianlinuxgirl] [at] [gmail] [dot] [com]

Abstract

Phylogenetic trees are used in important applications, including phylogenomics, phylogeography, gene function prediction, cladistics and the study of molecular evolution. In order to foster successful analysis, exchange, storage and reuse of phylogenetic trees and associated data, the phyloXML format was developed. It can store all necessary information about the phylogenetic tree, like clade, sequence, name and distance. The goal of this project is to implement support for phyloXML in BioRuby.

Documentation

BioRuby PhyloXML HowTo documentation


Testing

If you would like to test bleeding edge code then clone source code from github

git clone git://github.com/latvianlinuxgirl/bioruby.git

Then make a branch dev

cd bioruby
git branch dev

Then pull the contents of it

git pull origin dev

Install bioruby

sudo ruby setup.rb

If you don't have libxml-ruby library (or you get error `require': no such file to load -- xml (LoadError)') then install it. (Remember that bioruby is compatible with ruby1.8)

gem install -r libxml-ruby

If you get `require': no such file to load - mkmf (LoadError) error then do

sudo apt-get install ruby1.8-dev

If you have other problems with installation read http://libxml.rubyforge.org/install.xml

Then download and extract ncbi_taxonomy_mollusca.xml from phyloxml.org

Then run a ruby script with this sample code:

require 'rubygems' #might be optional on some systems
require 'bio'
phyloxml = Bio::PhyloXML::Parser.new('ncbi_taxonomy_mollusca.xml')
phyloxml.each do |tree|
  tree.each_node do |node|
    print "Scientific name: ", node.taxonomies[0].scientific_name, "\n"
  end
end

and see if it produces output.



Project Plan

Community bonding Period

  • Tasks:
    • set up wiki page and blog
    • set up development environment, code repository, bug tracking, git account
    • detailed study of BioRuby classes (Especially Bio:Tree, Bio:FlatFileIndex)
    • read and participate in bioruby, ruby, phyloxml mailinglists
    • study available XML parsers for Ruby, discuss in BioRuby community which one would be suited the best for this project.
  • What was done:
    • subscribed to mailing lists
    • created a blog
    • got familiar with Git (this was particularly useful: http://www.gitcasts.com/posts/railsconf-git-talk )
    • created GitHub account and forked bioruby project.
    • made first commit by adding sample phyloxml data files from www.phyloxml.org
    • reviewed BioPerl phyloXML implementation (also http://www.bioperl.org/wiki/HOWTO:Trees )
    • got familiar with libxml-ruby. Wrote simple program using both LibXML::XML::Reader and LibXML::XML::SAXParser to parse a simple xml file.
    • reviewed Ruby classes - Bio:Tree, Bio::Pathways
    • After discussions in mailing lists it has been agreed to use LibXML-ruby library, the LibXML::XML::Reader class

Week1 (May 23 - May 29):

  • Tasks:
    • Start writing parser using LibXML::XML::Reader. It should return a Bio::Tree object.
    • Implement function next_tree to parse and return the next phylogeny.
    • Design Tree::Node object for containing phyloxml elements.
    • Start mapping phyloxml elements to Bio::Tree::Node, start with taxonomy, branch_length, scientific_name
    • Write simple unit tests.
  • What was done:
    • discussed on mailing list the design of classes
    • Created site for brainstorming on design of the classes: http://wiki.github.com/latvianlinuxgirl/bioruby
    • Decided that its better to have PhyloXMLNode class which closely resembles phyloxml structure and have a method which returns Bio::Tree::Node object for compatibility, instead of extending Bio::Tree::Node class
    • wrote method next_tree
    • parsed name, description
    • parsed branch_length both the tag and the attribute
    • parsed clade (correctly making new nodes and edges between them)
    • parsed events
    • refactored a bit to make code shorter (made private methods to decrease code_duplication)
    • wrote unit tests (10 new)

Weeks 2 (May 30 - June 5):

  • Tasks:
    • Design Taxonomy class (continue discussion on mailing list)
    • Map other phyloXML elements to PhyloXML node objects
      • parse <sequence> tag (map it to Bio::Sequence object)
      • parse <taxonomy>
      • parse id_source, node_id, color,
      • parse date (write Date class)
      • parse <confidence> (write Confidence class)
    • Write unit tests for each new element mapped.
  • What was done:
    • wrote Taxonomy, PhyloXMLTaxonomy, Date, Color, Confidence, Sequence, Annotation, Accession, Points, Polygon
    • parsed <taxonomy>, <sequence> (and its elements), <date>, <color>, <confidence>, id_source, node_id, <uri>, <annotation>, <property>
    • parsed <distrubution>, <polygon>, <domain>
    • wrote unit tests (9 new tests, 51 new assertions)
    • learned about ruby object send method
    • reading http://sourcemaking.com/refactoring

Week 3 (June 6 - June 12):

  • Tasks:
    • parse the rest of tags for PhyloXMLNode
      • parse distribution tag (create new class for it)
      • parse reference tag
    • parse what pertains to Tree
      • parse attributes of <phylogeny> tag
      • parse clade_relation
      • parse sequence_relation

what was done:

  • parsed reference, clade_relation, sequence_relations elements. Wrote unit tests for them.
  • added unit test for testing polygon element.
  • now parsing property element when it clade subelement.
  • parsed attributes of phylogeny tag.
  • created new phyloxml file with made up data.
  • put all classes in PhyloXML namespace
  • now there are 26 tests, 86 assertions for testing phyloxml

Week 4 (June 13 - June 19):

  • Tasks:
    • Continue writing parser.
      • Extend PhyloXMLSequence class - add constructors/conversion tools to/from Bio::Sequence.
      • parse binary_characters
      • pluralize those attribute names which hold arrays of objects
      • add more unit tests
    • create a constructor for PhyloXML class to create a PhyloXMLNodes from Bio::Tree::Nodes
    • start working on documentation

What was done:

  • Changed those attribute names, which hold arrays of objects from singular to plural as was discussed in mailing lists.
  • Parsed binary characters element
  • Wrote a method to_biosequence to convert Bio::PhyloXML::Sequence to Bio::Sequence. (Still have to finish up details, e.g. deal with taxonomy since Bio::Sequence has attributes which would come from PhyloXML::Taxonomy, instead of PhyloXML::Sequence)
  • Worked on code documentation. Figured out the documentation conventions so that it looks nice when generated with rdoc. For class descriptions I used quite a lot information from www.phyloxml.org. Still have to work on usage examples.
  • Added method to_biotreenode (can you suggest a better name?) for Bio::PhyloXML::Node class, which returns Bio::Tree::Node (Constructor for opposite direction conversion I will leave for when I will write writer).
  • Added some more unit tests, but not as much as I planned, now 28 tests, 91 assertions (from 26 tests, 86 assertions last week)

Weeks 5,6 (June 20 - July 3):

  • Tasks:
    • Continue writing parser.
    • Write example usage code. Work on documentation.
    • Do extensive testing of phyloxml parser. Do testing on very very big datasets.
    • Write code to catch XML::Reader exceptions about ill-formed XML document.
  • Deliverable:
    • phyloXML parser.

what was done (week 5):

  • Asked for a code review. I got very good suggestions on what and how to improve things. Some of them did this week, some will come later.
  • Documented requirement of libxml-ruby.
  • Documented more PhyloXML::Node element.
  • Wrote code so that phyloxml test suite exits if libxml-ruby library is not present. (This took me quite a long time to figure it out. Eventually i sent email to ruby-talk mailing list and got a great help.)
  • Created a branch testbig. There created file test_phyloxml_big.rb wrote method parse_tree_dummy.
  • Did code profiling. Discovered that ~99% of the time is spent in Bio::Tree#parent. Changed the code to keep track myself of the current node in an array. Speed increase was tremendous. When parsing mollusca xml (1.5MB of data) it went down from 443 to 2 seconds. When parsing tree of life xml (45MB of data) it took 34 seconds instead of more than 3 hours.

what was done (week 6):

  • Split up code into 2 files: phyloxml_parser.rb and phyloxml_elements.rb
  • Incorportated the changes from testbig branch to dev branch for making code run faster (keeping track of current node in an array).
  • Did some more profiling.
    • Now, after previous improvements, 24.05% of the time is spent in Bio::PhyloXML::Parser#is_element? method and 12.56% of time in LibXML::XML::Reader#node_type (which is called by Bio::PhyloXML::Parser#is_element? ). I have an idea of how to increase the speed, but the gain won't be as noticeable as with previous fix, so I will leave this for later.
  • As per Eric's suggestion, wrote method each to iterate over all trees and wrote usage case in documentation.
   # Create new phyloxml parser
   phyloxml = Bio::PhyloXML::Parser.new('example.xml')
   # Print the names of all trees in the file
   phyloxml.each do |tree|
     puts tree.name
   end


  • Added unit tests. Now 36 tests, 120 assertions (from previous 28 tests, 91 assertions). I think that now most of the code is covered by unit tests.
  • Added validation step of the xml file when creating new instance of parser.
  • Added more documentation.


I think the goal of having a deliverable of PhyloXML parser has been achieved. It's not perfect, but I think it is in pretty good shape.

Here are some areas, where it will need more work.

  • Conversion from PhyloXML::Sequence to Bio::Sequence (method to_biosequence ) might need more work. Bio::Sequence class holds elements like classification and species, which do not belong to PhyloXML::Sequence elements, but instead belongs to PhyloXML::Taxonomy. Maybe PhyloXML::Node class should have method to_biosequence, since then it could incorporate also taxonomy elements.
  • Profiling revealed that ~10% of time is spent in LibXML::XML node type. which is called by is_element? method. The code can be refactored and made a little bit faster.
  • Implement method get_tree_by_name
  • Continue discussion of what to do with invalid xml files.
  • I am sure when in the process of writing PhyloXML::Writer and testing it with my parser something will come up.
  • One always can improve on documentation.

Week 7 (July 4 - July 10) - mid-term evaluations:

  • Tasks:
    • User testing.
    • Start working on PhyloXML writer.
  • what was done:
    • Wrote general generate_xml method which generates XML::Node objects according to given parameters. This will simplify code and make it more flexible.
    • Wrote to_xml methods for these PhyloXML classes: Sequence, Accession, Confidence, Id, Domain Architecture, ProteinDomain, Taxonomy. Partially implemented to_xml method for Annotation, Node.
    • wrote unit tests. New 6 tests, 19 assertions in test_phyloxml_writer.rb. Now have 36 tests, 120 assertions in test_phyloxml parser class.

Weeks 8 (July 11 - July 17):

  • Tasks:
    • Develop phyloXML writer. Allow for writing multiple trees (phylogenies).
  • What was done
    • Added or extended to_xml methods to various classes, like Annotation, Events and branch_length. Extended to_xml of PhyloXML::Node class, now can write its confidence elements.
    • Now the writer can correctly write ncbi mollusca taxonomy file and 4 first trees of taxonomy_example.xml
    • Added also unit tests for testing phyloxml writer. Now have 8 tests, 21 assertions, (from previous 6 tests, 19 assertions).

Week 9 (July 18 - July 24):

  • I will be in OSCON Open Source Convention (http://en.oreilly.com/oscon2009) from July 20 to July 24th in San Jose, CA. So probably not gonna be able to do too much work.
  • What was done:

I did not had much time to work (the conference was quite tiring, from 9am to 8pm 5 days ), but I managed to add other elements (clade_relation, property, sequence_relation) to the writer. Now can read in phyloxml_examples.xml file, write it back, and then use that file for parser unit tests, and all tests pass (. Also added boolean write_branch_length_as_subelement to the writer class with default value true. So usage would be like this, if you would want to write branch_length as a clade attribute.

     #read in tree
     tree = Bio::PhyloXML::Parser.new(TestPhyloXMLData.example_xml).next_tree
     # initialize writer
     writer = Bio::PhyloXML::Writer.new('./example_tree1.xml')
     # set boolean to a nondefault value
     writer.write_branch_length_as_subelement = false
     # write tree to file
     writer.write(tree)

Week 10 (July 25 - July 31):

  • July 25-29th I will attend Protein Society meeting in Boston.
  • Tasks
    • Bug fixing.
    • Code profiling.
    • Code refactoring.
  • What was done:
  • Writer is pretty much done. All objects have to_xml methods which are called accordingly to the phyloxml schema. I added support for changes in schema 1.10. Will need to add support for additions in schema 1.10. When I read in my test xml files and then write back and use for testing parser , all tests pass.
  • Added support in Parser and Writer for the Other object which will hold the stuff of other namespaces, which follow after all phylogenies.
  • Added unit tests. Now have 9 tests, 26 assertions for testing writer and 37 tests, 125 assertions for parser.
  • Got PixAxes Ruby 1.9 book and have been reading it. Very useful! :)
  • Did a little bit of code refactoring and clean up.

Week 11 (August 1 - August 7):

  • Tasks:
    • Extensive testing. Read big files (real world data), then write back the same file. Check for consistency.
    • Code profiling and optimization
    • Start working on documentation
  • Deliverable:
    • phyloXML writer and parser.
    • example code
  • What was done:
    • Coding. Added changes so that now it is completely compatible with phyloxml schema 1.10
    • Testing. added more unit tests (now writer has 9 tests, 26 assertions; parser: 40 tests, 134 assertions)
    • Profiling. I discovered that writer is really slow. The reason is the implementation of the Tree#children method, which does bfs_shortest_path algorithm. I had idea of tracking node children inside the node class as an array, but Naohisa Goto pointed out that then I would also have to deal with new node, edge addition, removal, etc. So better solution seems to, for now leave it as it is, and first improve Bio::Tree class. I am planning to do that after GSOC, since there is only one week left.

Week 12 (August 8 -August 14) :

  • Tasks:
    • clean up documentation
    • write examples of code how to use phyloXML. For example, read phyloXML, query the neighbors of specific taxa/species, BLAST the sequences to find homologous.
    • finish up wiki
  • Deliverable
    • Completed code
    • Extensive documentation
    • Example code
  • What was done:
    • Improved documentation in the code and added howto documentation and code samples in the doc/tutorial file and on wiki (linked from this page).
    • Asked a friend to test a sample code on a different machine, thus testing the downloading and installing process (especially of libxml library).
    • Added method PhyloXML::Node#extract_biosequence
    • Updated documentation on the design of classes http://wiki.github.com/latvianlinuxgirl/bioruby
    • Cleaned up the code.
    • Tagged code with end-of-GSOC tag.

August 24 - Final evaluations deadline

November 7

  • Finally I have updated Bio::Tree and Bio::Node classes to improve the writer speed.
  • Added Bio::Node::parent and Bio::Node::children (array of nodes) in order to avoid calling Tree::parent(node) or Tree::children(node), because those methods call breath first search on the underlying graph, which makes PhyloXML writer and parser incredibly slow. In contrast, Bio::Node::parent and Bio::Node::children keeps references to the respective nodes.
  • Updated Tree::add_edge, Tree::clear_edge, Tree::remove_edge, Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep track of Node::parent and Node::children correctly.
  • Now for PhyloXML writer it takes less than 1 second instead of ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB
  • To write the tree of life taxonomy file (~46MB) it takes 10 seconds (On 2.4GHz, 2.9GB RAM, running Ubuntu)
  • The code is at my github repo under tree_class branch.

Future

  • Any suggestions?