NeXML to ISAtab Mapping v0.1

From Phyloinformatics
Revision as of 11:33, 13 June 2012 by Eah13 (talk)
Jump to: navigation, search

NeXML to ISA v0.1.jpg

The diagram above shows the logical mapping of NeXML elements to ISAtab files. We'll review the basics of each format before explaining the diagram and identifying specific issues for comment.

NeXML Baiscs

NeXML is an XML specification for phylogenetic data. It is designed to be richly annotatable.

===ISAtab Basics

ISAtab is a tab delimited file format underpinning the ISAtools software suite. It breaks microbiology studies into three logical units:

Investigations, representing the overall inquiry.
Studies, representing a biological sample, usually a physical specimin, and
Assays, representing analysis of the sample via a variety of methods.

I'll refer to these as I, S, and A files below, respectively.

We are adapting this paradigm, built around wet lab experements, to 'in silico' experements such as the creation of phylogenetic trees from character state matrices. We have the support of the ISAtools team in doing this, as a successful mapping will open their software up to much of computational biology more broadly.

The goal of this mapping is to accurately represent as much of the information in a NeXML file as possible in ISA format, allowing the full power of the ISAtools suite to then be used for data curation, sharing, and reuse.

In general, information directly within <<nexml>> the element and its <<meta>> children will be mapped to the appropriate field of the I file. This is indicated in the diagram via the red boxes
OTU information will be referenced and pulled into any file as needed during transformation
Information in the <<characters>> element and its children will be mapped to the study. This makes character state matrices analogous to biological specimens, in high-throughput sequencing, for example.
As you can imagine, there may some complexity with this approach. First, when the matrices are generated from sequences from Genbank or a similar repository, they themselves may have come from a particular specimen. This is representable within NeXML, but I am not sure that it happens this way often. So, though the character state matrix will form an ISA study sample, it will not be a biologial sample as in other flavors, and will likely not be linked back to its ultimate biological source(s). Does this pose a problem that should be addressed?
Additionally, with some matrices, particularly morphological matrices, the researchers themselves may have a large hand in producing the actual data via coding, etc. It is conceivable for two different research teams to generate slightly different morphological matrices, even if the species and characters remain the same. Does this pose a problem that should be addressed?
Information in the <<trees>> element will be logically mapped as an assay of the character matrix, with the software, algorithm type, and other relevant details preserved as Parameters of the assay.
Potential Problem: Some files have multiple <<trees>> elements, each with one or more <<tree>> elements. Others have a single <<trees>> element containing multiple <<tree>> children. I am uncertain now if this is due to different study structure or inconsistent representation within NeXML representations of these studies. Any concerns here?
Note: It will likely be impossible to get the full richness of the NeXML into a usable tabular format. ISA provides a mechanism for referencing raw data files, and this will most likely be how we'll have to represent trees.