PhyloSoC:Multi-language bindings to the NEXUS Class Library

From Phyloinformatics
Jump to: navigation, search

Details

Name: Multi-language bindings to the C++ NEXUS Class Library

Student: David Suárez Pascal (aka Pascalin)

Mentor: Mark Holder

Sponsor: Google Summer of Code 2007

Goals

  • Provide the phyloinformatics community with three alternatives to employ NCL in addition to the existing C++ code.
  • Employ SWIG to define the most convenient interfaces to NCL in the following languages:
    1. Perl
    2. Python
    3. Ruby
  • Document and test in depth each version of NCL bindings.

Introduction

The NEXUS format constitutes the common denominator of distinct applications focused on phyloinformatics, a field of bioinformatics. As such, NEXUS is employed to represent phylogenetic (evolutionary relationship) data such as 'taxa', 'characters', 'data matrix' and 'trees'. The format was designed to be extensible by the addition of new types of data blocks, and this capability has been employed in the design of distinct programs focused on evolutionary and systematics issues.

The NCL (NEXUS class library) is a portable C++ library that provides a set of useful methods for reading, and writing NEXUS compliant files. The availability of such a portable implementation of this functions can be considered of central importance for the evolution of both the NEXUS format and the programs employing it, providing a standard reference and a common code shared between multiple software projects.

In such framework, providing multi-language bindings to NCL will extend the mentioned benefits and impact of NCL to applications developed in languages different of C++, such as Python, Perl and Ruby.

An additional benefit from this development is that enhanced productivity could be obtained by the rapid prototyping allowed by scripting languages such as these three.

The approach selected is to employ the Simplified wrapper and interface generator (SWIG), which is a standard and mature way of developing bindings from C/C++ code to a multiplicity of programming languages.

Get the code

Roadmap

Week Task Report date
I Design of Perl, Python and Ruby interfaces to NCL. Including unit tests. 04-jun-2007
II Implementing of first SWIG interface to NCL. Including makefile to automatize. 11-jun-2007
III Documenting and discusing design decisions of interface. 18-jun-2007
IV Testing of general functionality of bindings with focus on design problems. 25-jun-2007
V Fixing general design problems.Specializing and documenting Python bindings. 02-jul-2007
VI Python bindings testing and fixing. 09-jul-2007
VII Specializing and documenting Ruby bindings. 16-jul-2007
VIII Ruby bindings testing and fixing. 23-jul-2007
IX Specializing and documenting Perl bindings. 30-jul-2007
X Perl bindings testing and fixing. 06-aug-2007
XI Portability. Testing on different platforms. 13-aug-2007
XII Fine tuning distribution and portability. 20-aug-2007

Documentation

Documenting software surely is an art, so much in proprietary as in open source worlds, but in open source, with the global distribution of software developers it becomes an imperious necessity.

Design

Exploring languages

NCL deploys a library which has a good balance between flexibility and easiness of use employing a callback model that reminds me that of SAX, the Simple API for XML. As SAX, NCL is object oriented and focused toward text processing. Indeed, if you have used SAX, using NCL will seem familiar to you.

The languages we intend to extend NCL are also object oriented by nature, although not in the same degree. Python and Ruby were conceived with object orientation in mind, perhaps Ruby more than Python. Perl was blessed with object orientation after their inception, however, given the eclectic nature of Perl, object orientation seems to fit well with Perl customs.

As a first step of the development I translated one of the test programs included with NCL: nclsimpletest.cpp. This will serve as a proof of concept of each language bindings to NCL. Python and Ruby translations have been relatively easy (specially Ruby, because I knew Python before). Perl translation remains as a challenge, because the language and its approach to OOP seem a little weird.

Exploring SWIG

SWIG is a mature tool to automatically generate the wrappers needed to interface certain existing code with a scripting language. Generation of wrappers by SWIG requires a specification of functions, classes and variables which even when could theoretically be inferred by SWIG from C/C++ header files, really requires some degree of customization.

This last requirement is not so much a fault with SWIG as a by result of source code complexity, specially with C++. So the cleanest way to specify the required data is by means of special interface description files employing a mix of ANSI C/C++ and SWIG commands.

In our case, a main interface description file ncl.i has been created, whose function is to include the headers containing definitions we want SWIG to interface.

SWIG and C++

C++ was a difficult topic to SWIG first versions, but latest versions have considerably good support of C++ aspects like overloading, inheritance and namespaces. Indeed, SWIG's use of proxy classes provides an outcome that feels almost natural from target languages.

Besides proxy classes, SWIG provides a set of utilities to modify and customize behavior of wrapped classes.

Exploring NCL

NCL is formed by a collection of classes that this projects intends to make available from Perl, Python and Ruby scripting languages. After an analysis of NCL classes, the following classification seems useful to planning the task:

NCL classes
Core classes
  • NxsReader
  • NxsToken
  • NxsBlock
Derived classes
  • NxsAssumptionsBlock
  • NxsCharactersBlock
  • NxsDataBlock
  • NxsDistancesBlock
  • NxsEmptyBlock
  • NxsTaxaBlock
  • NxsTreesBlock
  • NxsUnalignedBlock
Support classes
  • Indent
  • NxsString
  • NxsSetReader
  • NxsException
  • NStrCaseInsensitiveEqual
  • NStrCaseSensitiveEqual
  • NxsDiscreteDatum
  • NxsDiscreteMatrix
  • NxsDistanceDatum
  • NxsStringEqual
Nested classes
  • NxsString::NxsX_NotANumber
  • NxsTaxaBlock::NxsX_NoSuchTaxon
  • NxsUnalignedBlock::NxsX_NoDataForTaxon
  • NxsUnalignedBlock::NxsX_NoSuchCharacter

From all these classes, wrappers for core and derived classes are strictly necessary if useful bindings are to be produced, however, support classes seems mostly superfluous from scripting language side. Nested classes are employed as exceptions, and if these are not essential in target languages, an adequate exception needs to be raised in each case.

NCL is programmed in C++ making wide use of the STL, templates and streams. While the three features are well supported by SWIG, the problem is to provide a natural translation between native classes in the target language and NCL own classes. For example, one of the first problems found interfacing NCL is with NxsToken::NxsToken(istream&), the constructor of NxsToken class. It waits an object from istream type, but SWIG does not provides a natural way to translate Python File objects to istream; indeed, as Python is programmed with C instead of C++, its File class employs C FILE struct and not istreams.

Of course, SWIG provides solutions to problems like this. One of this solutions is to include helper functions such as the following:

%inline
%{
  NxsToken * new_NxsTokenFromFilename(char* fname)
    {
      ifstream i;
      i.open(fname, ios::binary);
      return new NxsToken(i);
    }
%}

which could be used from Python:

>>> import ncl
>>> t = ncl.new_NxsTokenFromFilename("characters.nex")

Alternatively, we can use typemaps (one of the most powerful features of SWIG):

%typemap(in) istream& {
   if (!PyString_Check($input)) {
      PyErr_SetString(PyExc_TypeError, "Not a String");
      return NULL;
   }
   $1 = new ifstream(PyString_AsString($input), ios::binary);
}

and from Python:

>>> import ncl
>>> t = ncl.NxsToken("characters.nex")

Besides elegance, there are other differences between both approaches. The first version which employs a helper function returns a pointer to a NCL NxsToken instance which even when wrapped by SWIG is an obscure object in Python. The second approach returns a ncl.NxsToken instance, which is an instance of a Python wrapper class of NCL NxsToken.

Given the object oriented design of NCL and its resemblance with SAX which I have remarked previously, it seems the second approach is better, even when we need to slightly modify the API.

Deployment

After exploring both SWIG and NCL, the best approach seems to be a top-down approach, namely, to wrap just the needed classes of NCL and nothing more. In a certain way, this approach is similar to a test driven approach, because the best way to see which are the essential classes users will need to employ NCL bindings is to follow closely the requirements of a good set of examples.

NCL includes such a set of examples: nclsimpletest, ncltest and basiccmdline. First and second programs employ NCL to read and analyze a certain NEXUS input file, writing results to an output file. basiccmdline is an interactive program to explore NEXUS files.

NCL examples

Quoting from NCL README:

nclsimpletest

"nclsimplest is identical to the program described in the main documentation. [...] This program recognizes only two blocks, TAXA and TREES, and is designed to show how little programming is really necessary to use the NCL."

ncltest

"ncltest is more complete, although still not very user-friendly. This program should be able to read most existing NEXUS formatted data files, assuming those data files follow the original description of the format. Note that some newer programs (e.g. MrBayes and Mesquite) have added new features to the original NEXUS blocks, and thus files created for these programs may not be read correctly by ncltest and other programs based on the NCL library."

basiccmdline

"basiccmdline is designed to illustrate how to write a more user-friendly, command-based application using the NCL. When it starts, it presents the user with a prompt (BASICCMDLINE>) and allows the user to enter several commands, including 'help', 'quit', 'log', 'execute' and 'show'. Try using the 'execute' command to read a NEXUS file, then the 'show' command to output a summary of the information in the file."

So the first task to take this approach has been the translation of examples to each target language. However, progress with languages has not been uniform, mainly because Perl translations has been more difficult than Ruby and Python because its approach to object orientation is rather different. Because of this differences, Perl translation and implementation of bindings have been delayed. Python and Ruby translation of nclsimpletest were really easy.

Python

Python is widely supported by SWIG, and a consequence of this is most of NCL features were properly translated to Python. Following is the description of the main differences between NCL and its Python bindings.

Including NCL

Python bindings of NCL are contained in a module called: ncl, so if you want to employ NCL in a Python program you can simply use:

>>> import ncl

Using from ncl import * is not recommended because it could clash with other Python modules created with SWIG.

Using NCL

One goal of the Python bindings is to provide a natural interface to NCL methods and some definitions of NCL have been modified to meet this goal. However, most of NCL methods provide the most natural translation between NCL and Python classes. Particularly, NCL support classes are translated to and from Python strings and tuples.

Following is an example of a Python session exploring NCL:

>>> import ncl
>>> nexus = ncl.NxsReader()
>>> token = ncl.NxsToken(open('test.nex').read())
>>> taxa = ncl.NxsTaxaBlock()
>>> trees = ncl.NxsTreesBlock(taxa)
>>> for block in taxa, trees:
...  nexus.Add(block)
>>> nexus.Execute(token)
>>> trees.IsEmpty()
False
>>> trees.GetTranslatedTreeDescription(0)
'(P. fimbriata,P. robusta,((((P. americana,P. myriophylla),P. articulata),P. parksii),((P. polygama,P. macrophylla),(P. gracilis,(P. basiramia,P. ciliata)))))'
>>> trees.GetTreeDescription(0)
'(1,2,((((3,4),5),6),((7,8),(9,(10,11)))))'
>>> dir(taxa)
['AddTaxonLabel', 'ChangeTaxonLabel', 'CharLabelToNumber', 'Disable', 'Enable', 'FindTaxon', 'GetID', 'GetImpliedBlocks', 'GetImpliedBlocksConst',
 'GetInstanceIdentifier', 'GetInstanceName', 'GetMaxTaxonLabelLength', 'GetNumTaxonLabels', 'GetTaxonLabel', 'IsAlreadyDefined', 'IsEmpty',
 'IsEnabled', 'IsUserSupplied', 'NeedsQuotes', 'Read', 'Report', 'Reset', 'SetNexus', 'SkippingCommand', 'TaxonLabelToNumber', 'WriteAsNexus',
 'WriteTaxLabelsCommand', 'WriteTitleCommand', '__class__', '__del__', '__delattr__', '__dict__', '__disown__', '__doc__', '__getattr__',
 '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__str__',
 '__swig_destroy__', '__swig_getmethods__', '__swig_setmethods__', '__weakref__', '_s', 'errormsg', 'this']
>>> taxa.GetNumTaxonLabels()
11
>>> for i in range(11):
...     print taxa.GetTaxonLabel(i)
... 
P. fimbriata
P. robusta
P. americana
P. myriophylla
P. articulata
P. parksii
P. gracilis
P. macrophylla
P. polygama
P. basiramia
P. ciliata
>>> print taxa.Report()

TAXA block contains 11 taxa
        1       P. fimbriata
        2       P. robusta
        3       P. americana
        4       P. myriophylla
        5       P. articulata
        6       P. parksii
        7       P. gracilis
        8       P. macrophylla
        9       P. polygama
        10      P. basiramia
        11      P. ciliata

>>> print trees.Report()

TREES block contains 2 trees
        1       alpha   (rooted,default tree)
        2       beta    (rooted)

>>> 

Ruby

Ruby bindings of NCL provides a natural translation of NCL classes and methods, following Ruby language requirements and style. These few changes on NCL API are some renaming of constants according to capitalized convention on Ruby, and declaring of methods returning a boolean value as predicate methods. This was the case with Ncl::NxsTaxaBlock.IsEmpty?

Including NCL

To employ NCL bindings on a Ruby program you just need the following line:

irb(main):001:0> require 'ncl'

Once loaded, NCL classes are located inside of Ncl namespace.

Using NCL

One example says more than one thousand words...

irb(main):001:0> require 'ncl'
=> true
irb(main):002:0> nexus = Ncl::NxsReader.new
=> #<Ncl::NxsReader:0x2b57789cbf08>
irb(main):003:0> token = Ncl::NxsToken.new(File.open('test.nex').read)
=> #<Ncl::NxsToken:0x2b57789c3a88>
irb(main):004:0> taxa = Ncl::NxsTaxaBlock.new
=> #<Ncl::NxsTaxaBlock:0x2b57789b9790>
irb(main):005:0> trees = Ncl::NxsTreesBlock.new(taxa)
=> #<Ncl::NxsTreesBlock:0x2b57789b3bb0>
irb(main):006:0> nexus.Add(taxa)
=> nil
irb(main):007:0> nexus.Add(trees)
=> nil
irb(main):008:0> nexus.Execute(token)
=> nil
irb(main):009:0> trees.IsEmpty?
=> false
irb(main):010:0> trees.GetTranslatedTreeDescription(0)
=> "(P. fimbriata,P. robusta,((((P. americana,P. myriophylla),P. articulata),P. parksii),((P. polygama,P. macrophylla),(P. gracilis,(P. basiramia,P. ciliata)))))"
irb(main):011:0> trees.GetTreeDescription(0)
=> "(1,2,((((3,4),5),6),((7,8),(9,(10,11)))))"
irb(main):012:0> taxa.GetNumTaxonLabels()
=> 11
irb(main):013:0> (0..10).each {|i| puts taxa.GetTaxonLabel(i)}
P. fimbriata
P. robusta
P. americana
P. myriophylla
P. articulata
P. parksii
P. gracilis
P. macrophylla
P. polygama
P. basiramia
P. ciliata
=> 0..10
irb(main):014:0> puts taxa.Report

TAXA block contains 11 taxa
        1       P. fimbriata
        2       P. robusta
        3       P. americana
        4       P. myriophylla
        5       P. articulata
        6       P. parksii
        7       P. gracilis
        8       P. macrophylla
        9       P. polygama
        10      P. basiramia
        11      P. ciliata
=> nil
irb(main):015:0> puts trees.Report

TREES block contains 2 trees
        1       alpha   (rooted,default tree)
        2       beta    (rooted)
=> nil
irb(main):016:0>

Log

Week 1

  • Google SOC 2007 coding phase started --pascalin 10:57, 28 May 2007 (EDT)

Week 2

  • Python and Ruby versions of nclsimpletest --pascalin 23:57, 4 June 2007 (EDT)
  • Subversion repository for NCL Bindings created --pascalin 23:57, 4 June 2007 (EDT)
  • Beginning of SWIG coding --pascalin 23:57, 4 June 2007 (EDT)

Week 3

  • Set of SWIG interfaces created, one for each NCL header file --pascalin 00:54, 11 June 2007 (EDT)
  • Starting review of SWIG interface design --pascalin 12:59, 11 June 2007 (EDT)

Week 4

  • Rethinking SWIG interface deployment, wrapping every class and function in NCL seems too difficult and unnecessary. Choosing top-down approach: instead of trying to implement wrappers for every object in NCL, wrapping of main classes and methods seems more fitted to SWIG mindset --pascalin 13:06, 19 June 2007 (EDT)
  • Wrapped NxsReader, NxsToken, NxsBlock, NxsTaxaBlock, NxsTreesBlock. Corrections made on python and ruby nclsimpletests. NxsToken constructor wrapper raises error --pascalin 13:06, 19 June 2007 (EDT)

Week 5

  • Redefined constructor of NxsToken to receive string instead of istream& because it allows more natural approach from script side --pascalin 03:25, 3 July 2007 (EDT)
  • Report method of NxsBlock, NxsTaxaBlock, NxsTreesBlock,NxsAssumptionsBlock, NxsDistancesBlock, NxsCharactersBlock and NxsDataBlock redefined to return a string --pascalin 03:25, 3 July 2007 (EDT)
  • nclsimpletest.py, ncltest.py and Python makefile modified according changes in API --pascalin 03:25, 3 July 2007 (EDT)

Week 6

  • Reviewing SWIG directors documents --pascalin 11:22, 3 July 2007 (EDT)
  • Implementing directors for NxsReader, NxsToken and NxsBlock --pascalin 19:48, 5 July 2007 (EDT)

Week 7

  • Filling the SOC mid-term survey --pascalin 16:14, 9 July 2007 (EDT)
  • Reviewing SWIG Ruby docs --pascalin 13:14, 12 July 2007 (EDT)
  • Commited to subversion most recent changes --pascalin 22:42, 12 July 2007 (EDT)

Week 8

  • Small fixes to both Ruby and Python versions of ncltest. --pascalin 14:48, 24 July 2007 (EDT)
  • Added NxsString& typemap to cope with changes in NCL truk version. --pascalin 14:48, 24 July 2007 (EDT)

Week 9

Week 10

  • Implementing build management with CMake --pascalin 16:01, 9 August 2007 (EDT)

Week 11

Week 12

  • Including documentation files (README, INSTALL and TODO) in the package --pascalin 01:21, 20 August 2007 (EDT)