NeXML to ISAtab Mapping v0.1

From Phyloinformatics
Revision as of 11:05, 20 June 2012 by Hilmar (talk) (Diagram Explanation & Issues)
Jump to: navigation, search

NeXML to ISA v0.1.jpg

The diagram above shows the logical mapping of NeXML elements to ISAtab files. We'll review the basics of each format before explaining the diagram and identifying specific issues for comment.

NeXML Baiscs

NeXML is richly annotatable an XML specification for phylogenetic data. The examples below are taken from a file named treebase-record.xml, available here. You may see the tb: prefix in these examples. This is not a part of the NeXML standard itself but rather TreeBASE's extension of the standard and an example of how that extensibility can increase the detail within the file.


Nexml files have <<nexml>> root-level elements. These are little other than containers for the rest of the file, though their <<meta>> children contain important file-level metadata. Nexml nexml.jpg Above: screenshot of a NeXML <<nexml>> element


<<otus>> elements are groupings of Operational Taxonomic Units (i.e. taxa of some form of another such as a species, genus etc.). The <<otu>> children of these elements, unsurprisingly, contain the specific information on each grouping. It's important to understand that these OTUs may sometimes be named or other times are simply statistically inferred. They can be linked out to controlled vocabularies, but sometimes are not. NeXML OTUs.jpg Above: screenshot of a NeXML <<otus>> element


<<characters>> elements contain character martices of some sort, which are used to build the phylogenetic trees. The matrices can be contain morphological characteristics or one of several types of microbiological characteristics: DNA sequences, RNA sequences, Protein structures, etc. The <<characters>> element refers to an <<otus>> element, while each <<row>> of the <<matrix>> refers to an <<otu>> element. Nexml characters.jpg Above: screenshot of a NeXML <<characters>> element


<<trees>> elements are grouping of phylogenetic trees. The latter are stored in <<tree>> elements. Nexml trees.jpg Above: screenshot of a NeXML <<trees>> element


NeXML's rich annotation capabilities come via <<meta>> elements, which can occur as children of any other element. They are not shown on the diagram above because of their potential ubiquity. In practice, <<meta>> elements under <<nexml>> contain bibliographic-style metadata such as author, publication, title, etc., while they tend to carry more specific information about other elements when they are children of those elements.

ISAtab Basics

ISAtab is a tab delimited file format underpinning the ISAtools software suite. It breaks microbiology studies into three logical units:

Investigations, representing the overall inquiry.
Studies, representing a biological sample, usually a physical specimin, and
Assays, representing analysis of the sample via a variety of methods.

I'll refer to these as I, S, and A files below, respectively.

We are adapting this paradigm, built around wet lab experements, to 'in silico' experements such as the creation of phylogenetic trees from character state matrices. We have the support of the ISAtools team in doing this, as a successful mapping will open their software up to much of computational biology more broadly.

Diagram Explanation & Issues

The goal of this mapping is to accurately represent as much of the information in a NeXML file as possible in ISA format, allowing the full power of the ISAtools suite to then be used for data curation, sharing, and reuse.

In general, information directly within <<nexml>> the element and its <<meta>> children will be mapped to the appropriate field of the I file. This is indicated in the diagram via the red boxes
OTU information will be referenced and pulled into any file as needed during transformation
Information in the <<characters>> element and its children will be mapped to the study. This makes character state matrices analogous to biological specimens, in high-throughput sequencing, for example.
As you can imagine, there may some complexity with this approach. First, when the matrices are generated from molecular sequences, these sequences will have been obtained from specimens. This is representable within NeXML, but often the source specimens are not references in NeXML files (unlike, for example, in the Genbank records corresponding to those sequences). So, though the character state matrix will form an ISA assay input, it will likely not be linked back to the ultimate biological source(s) used for obtaining the character state values. Does this pose a problem that should be addressed?
Additionally, with some matrices, particularly morphological matrices, the researchers themselves may have a large hand in producing the actual data via coding, etc. It is conceivable for two different research teams to generate slightly different morphological matrices, even if the species and characters remain the same. Does this pose a problem that should be addressed?
Information in the <<trees>> element will be logically mapped as an assay of the character matrix, with the software, algorithm type, and other relevant details preserved as Parameters of the assay.
Potential Problem: Some files have multiple <<trees>> elements, each with one or more <<tree>> elements. Others have a single <<trees>> element containing multiple <<tree>> children. I am uncertain now if this is due to different study structure or inconsistent representation within NeXML representations of these studies. Any concerns here?
Note: It will likely be impossible to get the full richness of the NeXML into a usable tabular format. ISA provides a mechanism for referencing raw data files, and this will most likely be how we'll have to represent trees.

Request for Comments

Comments on the above issues or any other area of interest are welcome. Please email both the NeXML group on Sourceforge and the MIAPA Google Group with comments. Those are also good places to email questions, though you can also email me if you prefer.