Difference between revisions of "NeXML Elements"

From Phyloinformatics
Jump to: navigation, search
(char)
(Characters)
Line 266: Line 266:
  
 
An NeXML <code>characters</code> block must have:
 
An NeXML <code>characters</code> block must have:
* zero or one <code>format</code> element.
+
* <code>format</code> - one and only one for each concrete type.
* one and only one <code>matrix</code> element.  
+
* <code>matrix</code> - one and only one for each concrete type.
  
 
===Format===
 
===Format===
Line 282: Line 282:
 
* [http://nexml.org/nexml/html/doc/schema-1/characters/standard/#StandardFormat StandardFormat]
 
* [http://nexml.org/nexml/html/doc/schema-1/characters/standard/#StandardFormat StandardFormat]
  
Atleast one <code>states</code> element must be defined for any type of <code>format</code> except <code>ContinuousFormat</code>. <code>char</code> must be defined for all type of <code>format</code>.
+
A <code>format</code> element can have:
 +
* <code>states</code> - one or more for each concrete type except <code>ContinuousFormat</code>. A Continuous type is stateless.
 +
* <code>char</code> - one or more for each concrete type.
  
 
====states====
 
====states====

Revision as of 02:31, 17 June 2010

Preface

The following document exhaustively covers the NeXML elements. The coverage is purely technical and I have linked to the appropriate schema definitions from the NeXML site for completeness. A knowledge of the XML Schema Definition( XSD ) is unnecessary to read the document. A discussion on the schema design can be found here: NeXML schema discussion. Note that this documentation is not final yet.

Root

The root element of NeXML is called nexml. This element is an instance of the Nexml class.

Attributes:

  • version - a decimal number indicating the nexml schema version. At present this value is 0.9.
  • generator - an optional attribute of type string, which is used to identify the program that generated the file.

Namespaces: (Where it says "by convention" in the list below, the convention applies to the three-letter prefixes which are free to vary in most cases, not the namespaces themselves):

  • the xml namespace prefix that identifies xml schema semantics that might be inlined in the file. By convention this is of the format xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" so that parts of schema language used inside nexml (e.g. where a concrete subclass must be specified) are identified by the xsi prefix.
  • the nexml namespace prefix, by convention of the format xmlns:nex="http://www.nexml.org/1.0", so that locations where nexml specific types are referenced (e.g. data type subclasses) these are identified by their nex prefix.
  • the default namespace, xmlns="http://www.nexml.org/1.0", which is necessary for namespace aware processors (such as the NeXML-to-CDAO xslt stylesheet produced by an EvoInfo subgroup at the Spring 2009 hackathon).
  • the xml namespace prefix, required to be of the format xmlns:xml="http://www.w3.org/XML/1998/namespace". This may be used, for example, to specify the base address of the document (using the xml:base attribute).

Lastly, to associate the instance document with the nexml schema, it requires an attribute to specify the nexml schema location, and the namespace it applies to. This is of the format xsi:schemaLocation="http://www.nexml.org/1.0 http://www.nexml.org/1.0/nexml.xsd". Notice that this attribute is a schema language snippet (identified by the xsi:prefix) that identifies a namespace http://www.nexml.org/1.0) and associates it with a physical schema location (http://www.nexml.org/1.0/nexml.xsd).

An example of root element would be:

<xml>

<?xml version="1.0" encoding="ISO-8859-1"?>
<nex:nexml
    version="0.9"
    generator="eclipse"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:xml="http://www.w3.org/XML/1998/namespace"
    xmlns:nex="http://www.nexml.org/1.0"
    xmlns="http://www.nexml.org/1.0"
    xsi:schemaLocation="http://www.nexml.org/1.0 http://www.nexml.org/1.0/nexml.xsd">
</nex:nexml>

</xml>

The root element can contain:

  • zero or more semantic annotations
  • one or more OTUs elements
  • zero or more characters elements
  • zero or more trees elements (in mixed order with characters elements).

OTUS

OTUS means Organizational Taxonomic Units. The otus element defines a collection of taxon. An otus is an instance of the Taxa class.

Attributes:

  • id - a file level unique id.
  • label - a human readable description of the otus.

An otus element can take:

  • zero or more semantic annotations
  • zero or more or otu element.

OTU

An otu element defines a taxon. An otu is an instance of the Taxon class.

Attributes:

  • id - a file level unique id
  • label - a human readable
  • class - takes the id of a class element. This attribute is optional.

An otu element can take:

  • zero or more semantic annotations.

Example

<xml>

   <otus id="tax1">
       <otu id="t1"/>
       <otu id="t2"/>
   </otus>

</xml>

Class

A set is defined in NeXML with a class element. A class is an instance of the Class class.

Attributes:

  • id - a file level unique id
  • label - a human readable description of the class. A label is optional.

Note: This attribute has not been implemented in the schema fully.

Trees

Trees in NeXML are described following the GraphML syntax. Nodes and edges of a tree are defined using the node and edge element respectively. In NeXML acyclic trees are defined with the tree tag while cyclic trees or networks are defined with the network tag, so that in a network indegree of a node can be more than one. A trees element is a conatainer for both tree and network.

Attribute:

  • id - a file level unique id.
  • otus - takes the id of an otus element defined previously.
  • label - a human readable description of the trees. A label is optional.

A trees element is of the Trees concrete type.

A trees element can have:

  • one or more tree
  • one or more network ( in mixed order with tree )

Tree

Phylogenetic trees are defined in NeXML with the tree tag. Two types of tree can be defined: IntTree and FloatTree. While an IntTree can only have integer edge length a FloatTree has an IEEE 754-1985 compliant floating point number for its edge length.

Attributes:

  • id - a file level unique id
  • xsi:type - defines the type of the tree. It can be either IntTree or a FloatTree
  • label - a human readable description of the tree. A label is optional.

With the AbstractTree class at the root of the hierarchy the following sub types are defined.

A tree must contain the following in the same order:

  • one or more node
  • zero or one rootedge and
  • one or more edge

Network

Phylogenetic networks are defined with the network tag.

Attributes:

  • id - a file level unique id
  • xsi:type - defines the type of the network. It can take any of the two values: IntNetwork, FloatNetwork
  • label - a human readable description of the tree. A label is optional.

Difference between an int and float network is that same as that of int and float tree.

A network is modeled by the AbstractNetwork class and two concrete implementations exist:

A tree must contain the following in the same order:

  • one or more node
  • one or more edge

Nodes

A node tag defines a node of a tree or a network.

Attributes:

  • id - a file level unique id.
  • otu - id of a previously defined otu element. This is optional.
  • root - takes true if the node is a root node, false otherwise.

The tree is considered rooted, multiply rooted or unrooted based on how many root nodes does it have.

Nodes have the following hierarchy:

Edges

An edge tag defines an edge in a tree or a network.

Attributes:

  • id - a file level unique id
  • source - id of the node to be served as the source
  • target - id of the node to server as the target
  • length - length of the edge

Inheriting from the AbstractEdge edges can be of the following concrete type based on whether it is defined for a tree( int or float ) or a network( again, int or float ):

Rootedge

A rootedge tag is used to indicate a time span leading up to the root( principally for coalescent trees ).

Attributes:

  • id - a file level unique id
  • target - id of the root node
  • length - length of the edge

Rootedges have the following heirarchy:

Example

<xml> <trees otus="taxa1" id="Trees" label="TreesBlockFromXML">

   <tree id="tree1" xsi:type="nex:FloatTree" label="tree1">
       <node id="n1" label="n1" root="true"/>
       <node id="n2" label="n2" otu="t1"/>
       <node id="n3" label="n3"/>
       <node id="n4" label="n4"/>
       <node id="n5" label="n5" otu="t3"/>
       <node id="n6" label="n6" otu="t2"/>
       <node id="n7" label="n7"/>
       <node id="n8" label="n8" otu="t5"/>
       <node id="n9" label="n9" otu="t4"/>
       <rootedge target="n1" id="re1" length="0.34765" />
       <edge source="n1" target="n3" id="e1" length="0.34534"/>
       <edge source="n1" target="n2" id="e2" length="0.4353"/>
       <edge source="n3" target="n4" id="e3" length="0.324"/>
       <edge source="n3" target="n7" id="e4" length="0.3247"/>
       <edge source="n4" target="n5" id="e5" length="0.234"/>
       <edge source="n4" target="n6" id="e6" length="0.3243"/>
       <edge source="n7" target="n8" id="e7" length="0.32443"/>
       <edge source="n7" target="n9" id="e8" length="0.2342"/>
   </tree>
   <tree id="tree2" xsi:type="nex:IntTree" label="tree2">
       <node id="n1" label="n1"/>
       <node id="n2" label="n2" otu="t1"/>
       <node id="n3" label="n3"/>
       <node id="n4" label="n4"/>
       <node id="n5" label="n5" otu="t3"/>
       <node id="n6" label="n6" otu="t2"/>
       <node id="n7" label="n7"/>
       <node id="n8" label="n8" otu="t5"/>
       <node id="n9" label="n9" otu="t4"/>
       <edge source="n1" target="n3" id="e1" length="1"/>
       <edge source="n1" target="n2" id="e2" length="2"/>
       <edge source="n3" target="n4" id="e3" length="3"/>
       <edge source="n3" target="n7" id="e4" length="1"/>
       <edge source="n4" target="n5" id="e5" length="2"/>
       <edge source="n4" target="n6" id="e6" length="1"/>
       <edge source="n7" target="n8" id="e7" length="1"/>
       <edge source="n7" target="n9" id="e8" length="1"/>
   </tree>
   <network id="tree3" xsi:type="nex:IntNetwork" label="tree3">
       <node id="n1" label="n1"/>
       <node id="n2" label="n2" otu="t1"/>
       <node id="n3" label="n3"/>
       <node id="n4" label="n4"/>
       <node id="n5" label="n5" otu="t3"/>
       <node id="n6" label="n6" otu="t2"/>
       <node id="n7" label="n7"/>
       <node id="n8" label="n8" otu="t5"/>
       <node id="n9" label="n9" otu="t4"/>
       <edge source="n1" target="n3" id="e1" length="1"/>
       <edge source="n1" target="n2" id="e2" length="2"/>
       <edge source="n3" target="n4" id="e3" length="3"/>
       <edge source="n3" target="n7" id="e4" length="1"/>
       <edge source="n4" target="n5" id="e5" length="2"/>
       <edge source="n4" target="n6" id="e6" length="1"/>
       <edge source="n7" target="n6" id="e7" length="1"/> 
       <edge source="n7" target="n8" id="e7" length="1"/>
       <edge source="n7" target="n9" id="e8" length="1"/>
   </network>

</trees> </xml>

Characters

The characters element defines comparative data. NeXML dictates that the allowed parameter space for the observations be declared with a format element and the actual observation with the matrix element; both nested within the characters element. Two broad observation categories are allowed: raw character sequences and granular character state observations. Under both these categories NeXML provides means of describing observed DNA, RNA, Protein, Restriction, Standard, and Conitnuous data.

Attributes:

  • id - a file level unique id.
  • otus - takes the id of an otus element defined previously.
  • xsi:type - can be one of the following: DnaSeqs, DnaCells, RnaSeqs, RnaCells, RestrictionSeqs, RestrictionCells, StandardSeqs, StandardCells, ContinuousSeqs, ContinuousCells.
  • label - a human readable description of the character block. A label is optional

The AbstractBlock class is at the root of the characters hierarchy with the following subtypes:

An NeXML characters block must have:

  • format - one and only one for each concrete type.
  • matrix - one and only one for each concrete type.

Format

The format element defines the the allowed characters and states in a matrix, and their ambiguity mapping. Within format is enclosed states and char definition.

Attributes: none.

format is modeled by the AbstractFormat type with the following concrete types:

A format element can have:

  • states - one or more for each concrete type except ContinuousFormat. A Continuous type is stateless.
  • char - one or more for each concrete type.

states

states is a container for defined states and their mappings. A states can have zero or more state, uncertain_state_set, polymorphic_state_set elements.

Attributes:

  • id - a file level unique id.
  • label - a human readable description. This attribute is optional.

With AbstractStates class at the root of the hierarchy states can take up one of the following concrete implementations based on the type of format

The type of states depends on the type of its parent format. Continuous data is stateless, hence there is no states type corresponding to it. Restriction data implies only two states: presence( 1 ) or absence( 0 ). Naturally, two( and only two ) state are defined for the RestrictionStates type. All other types may have zero or more state defined for them.

state

state defines a possible observation state with its symbol attribute.

Attributes:

  • id - a file level unique id
  • symbol - a simple type that restricts from AbstractSymbol
  • label - a human readable description. A label is optional.

state is of the AbstractState type. A state can be of the following kind depending on the type of the parent states element:

polymorphic_state_set

polymorphic_state_set resolves ambiguous states in an "and" context. To define the possible ambiguities polymorphic_state_set nests zero or more member element.

Attributes: same as state.

AbstractPolymorphicStateSet

uncertain_state_set

uncertain_state_set resolves ambiguous states in a "or" context. To define the possible ambiguities uncertain_state_set nests zero or more member element.

Attributes: same as state.

AbstractUncertainStateSet

member

char

A char specifies which states apply to which matrix columns. A char is of the AbstractChar type.

Attributes: With the type of the char in place, a char can take one or more of the following attributes:

  • id - a file level unique id is a must for all the char elements
  • states - the value being the id of a pre defined states element. Except the ContinuousChar type all chars take a states attribute.
  • codon - specify the codon position. DnaChar and RnaChar optionally take the codon attribute.

Depending on the type of the parent format element, a char can be an instance of the following type:

Matrix

A matrix element holds the state observations. One or more row can be defined in a matrix. Depending on the characters type, matrix element can be of 2 abstract types each with 6 concrete type.

Attributes: none

Row

The row element defines a single row of a matrix.

Attributes:

  • id - a file level unique id
  • label - a human readable description of the row. A label is optional.
  • otu - takes the id of an otu defined previously

Just like its parent matrix, row can be of two abstract types each with 6 concrete type.

seq

seq defines a raw character sequence. Depending on the type of observed data( literally, it depends on the type of the parent row for which the sequence is defined ) NeXML allows for following type of sequences are allowed:

cell

A cell defines granular observation. Depending on the kind of observed data cell can be:

Semantic Annotation

Fundamental data objects in NeXML can be annotated using RDFa. This way, the annotationsare directly available to any off-the-shelf RDFa parser but are also simple to integrate into non- RDF-aware processing libraries. The annotations are expressed using recursively nested meta elements, and are essentially triples whose subjects are identified by the about attribute. To specify that a NeXML element such as a tree is the subject, the about attribute and the id attribute of the element must match. This way, core NeXML elements can be converted to RDF (for example by using an XSL stylesheet), RDFa annotations can be extracted from the NeXML (using another stylesheet), and the two resulting graphs can be aligned by the subjects of their respective triples.

  • If the triple’s object value is a literal such as a string or a number, the exact data type is specified by the datatype attribute; these types are typically core XML schema types such as xsd:string for atomic types, or rdf:Literal for nested XML element structures that are to be parsed as opaque literals in an RDF graph. The object value is enclosed inside the meta element and the predicate is specified using the property attribute (whose value must be a CURIE). meta elements of this type are of the subclass nex:LiteralMeta.
  • If the triple’s object value is a remote resource, its location is specified using the href attribute. The predicate for this class of triples is specified as a CURIE using the rel attribute. meta elements of this type are of the subclass nex:ResourceMeta.
  • If the triple’s object value is a nested annotation, its predicate is specified as a CURIE using the rel attribute. Since this enclosing meta element is to be transformed into an anonymous RDF node, it needs to be identified as the subject of a reification by assigning it an about attribute (as per the RFa rules for identifying triple subjects).