Supporting NEXUS

From Phyloinformatics
Jump to: navigation, search

Contents

Summary

This Phyloinformatic Hackathon page describes the NEXUS format and efforts around standardizing and extending this format.

NEXUS and comparative evolutionary analysis

NEXUS (Maddison, et al., 1997) is a data exchange file format for systematic data, designed for use in comparative evolutionary analysis (see NEXUS Specification). NEXUS is a de facto standard among researchers who focus on phylogenetic inference and hypothesis-testing using models of character evolution. Though alternative formats exist, NEXUS is highly expressive and offers a flexible means to represent diverse types of data. On the other hand, for a flat file format NEXUS tries to do many things, often using context-sensitive tokens and parser hints. Several legacy design features can make files hard to parse, (e.g., character data matrices in 'interleaved' format require parsers to take line breaks into account where elsewhere commas and semicolons are used as delimiters). NEXUS files typically include a set of character data (e.g., a sequence alignment) along with a tree, and they often include additional information describing the characters or the assumptions underlying an analysis.

Rationale for supporting NEXUS

Developers at the NESCent hack-a-thon chose "Supporting NEXUS" as a high-priority development target even though it was not listed in our set of use cases, and even though it was agreed that NEXUS should be replaced by a simpler and more standardized format. Support for NEXUS now is important for the following reasons:

  1. NEXUS is poorly supported by software today, with many inconsistent or radically incomplete implementations.
  2. though NEXUS should be replaced in the future, it will continue to be a de facto exchange format for at least a few more years
  3. meanwhile, comparative evolutionary analysis is increasing in terms of impact, amount of software, and number of users

Plans and progress in supporting NEXUS via software development and testing

To support NEXUS with software means providing users the tools to read, write, search, edit and visualize the contents of a NEXUS file. Because the NEXUS format is complex, most software does not implement much of the standard, and may violate it in various ways. Therefore, the focus here is on reading and writing conformant files. How do we encourage conformance?

  1. assemble a set of NEXUS test files to challenge parsers
  2. develop a library reference implementation that can be included in application software (NCL or Bio::NEXUS)
  3. evaluate the conformance of existing tools and libraries using test files
  4. write wrappers for non-conforming program (target: clustalw)
  5. develop a web-based validation tool (format validation and round-tripping)

At the hack-a-thon we discussed plans for all of these items, but our effort was mostly in the area of #1 and #3. We assembled a set of NEXUS files starting with files from the test suites for Bio::NEXUS and NCL. Other developers then used these test files to test their NEXUS implementations. We also found some bugs in the test files themselves (e.g., test files with double-quoting used to encapsulate tokens were removed because there is no explicit support for double-quoting in the standard).

Best Practices for Users and Developers

Recommendations for those who . . . manage data in NEXUS files:

  1. put NEXUS files under version control using RCS, cvs or svn (ask your sys admin)
  2. attempt to validate the format of your file using one of the methods recommended below
  3. remove any dependence on deprecated elements such as DATA blocks or newtaxa commands

. . . develop software that reads or writes NEXUS files:

  1. test the current ability of your software to read and write NEXUS files
  2. try to meet at least level II conformance
    1. only read, never write, files with DATA blocks (deprecated)
    2. read TAXA, CHARACTERS and TREES blocks

Future Challenges

  • use a systematic survey to guage the relative importance of various use cases, various users, and various software applications.
  • identify and remediate ambiguities
  • reach out to users and developers to identify strategies to increase conformance

NEXUS and Comparative Evolutionary Analysis

Comparative analysis is the foundation of contemporary genome analysis. When a new genome sequence is determined, we gain an enormous amount of useful information. A non-biologist might imagine that this useful information comes from plugging the nucleotide sequence information into an a priori biophysical model of the cell to predict its behavior. But there is no such model. Instead, useful information is squeezed from genome sequences primarily by comparing them to related sequences (e.g., human compared to mouse or chimpanzee) and interpreting the similarities and differences.

Because these similarities and differences emerge by a process of descent with modification, the ultimate framework for comparative analysis is evolutionary biology. Methods of comparative analysis (ideally) convert fuzzy and difficult questions about how to interpret such similarities and differences into well posed questions about processes of change along a phylogenetic tree. This kind of analysis can be applied to any kind of data-- evolved differences in gene sequences, genome size, protein activity, gene expression, morphological characters, and so on.
KOG0011 cdat.jpg
The methodological generality of evolutionary analysis is based on the foundation provided by what may be called the character-state data model, analogous to the Entity-Attribute-Value model. In this model, the features of a set of OTUs (Operational Taxonomic Units) are represented as the observable states (or character states) of a set of underlying characters. In Entity-Attribute-Value terminology, the states are values, the characters are attributes, and the OTUs are entities. For a protein sequence alignment, for instance, each OTU is a protein, each column represents a character, and the state of each OTU for a particular character (column) is an amino acid or a gap (often "-"), unless the data are missing (often "?"). Characters for which the states are finite are discrete characters; continuous characters are also possible (e.g,. beak length, protein activity).

NEXUS (Maddison, et al., 1997) is a data exchange file format for character-state data and trees, designed for use in comparative evolutionary analysis. It has been a de facto standard among researchers who focus on phylogenetics and analysis of character evolution, and is used in popular software such as PAUP*, Mesquite, MrBayes. The terminology used in NEXUS files, and in the NEXUS standard, draws on the implicit character-state data model:

  • OTUs (entities) are referred to as "TAXA" (e.g., TAXA block; taxlabels and taxset commands; ntax subcommand)
  • columns of comparative data (attributes) are referred to as "characters" (e.g., CHARACTERS block, charlabels and charset commands; nchar subcommand)
  • attribute values are referred to as "states" (e.g., statelabels and stateset commands)

Although NEXUS files initially were used almost entirely for discrete morphological characters, the developers of NEXUS had diverse types of data in mind when they wrote the standard (Maddison, et al., 1997), e.g., including provisions for such things as ambiguous characters, continuous characters, frequency distributions of states, as well as commands specific for molecular data to specify codon locations and to define alternative genetic codes. Thus, NEXUS provide a rich way to represent diverse types of data that might be used in a comparative analysis of genes, proteins, or genomes.

Rationale for Supporting NEXUS

Support for NEXUS now is important for the following reasons:

  1. NEXUS is poorly supported by software today, with many inconsistent or radically incomplete implementations.
  2. though NEXUS should be replaced in the future, it will continue to be a de facto exchange format for at least a few more years
  3. meanwhile, comparative evolutionary analysis is increasing in terms of impact, amount of software, and number of users

At the time the NEXUS standard was developed, the main emphasis was on flexibility and expressiveness, and there was no great need for strict conformance because the format was not used in multi-step pipelines. Thus, in the original description the authors wrote:

Any particular implementation of the NEXUS format, even by the authors of this document, should not necessarily be considered as complying exactly with the NEXUS standard.

Though today we do not quite have a 'tower of Babel' in evolutionary informatics, it is clear that incompatibilities create problems for users who attempt to do integrative analyses using multiple software applications.

Is it too late to support NEXUS? Should we instead move to a new standard based on XML? Questions such as these were discussed at the hackathon. Although we should begin to develop a new standard, the development of such a standard and the software to support it will take years to accomplish. Supporting NEXUS now addresses an immediate need, and will make the transition to a new standard easier (i.e., the greater the conformance to an old standard, the easier the translation to a new one).

Identifying Current Challenges

At the NESCent hackathon, various users and developers identified challenges in supporting the NEXUS format.

Poorly formed files

A surprising fraction of "NEXUS" files submitted to TreeBase are poorly formed.

Incomplete implementations

Appropriate response to this is to make conformance easier by providing a validating parser, test files, and a reference implementation.

These problems are known to occur in some implementations

  • Name length unlimited in standard but limited in implementation (common)
  • Quoted strings allowed in standard but disallowed in implementation (common)
  • Syntax of output files invalid
  • "utree" command in PAUP (should use [&U] as in standard)
  • Incompatibilities due to platform-specific treatment of line endings
  • Tokens not implemented
  • Continuous data not implemented
  • Interleave not implemented
  • Comments lost

These problems are known to occur in some files

  • trees without commas!
  • unencapsulated punctuation symbols in OTU names

Mistaken implementations

Underutilization of the standard

Maddison, et al. (1997) urged developers to rely on the published standard when possible, rather than to extend it. Below are some examples of extensions or commenting that could be done formally within the NEXUS standard:

  1. Partitions can be specified via SETS block
  2. Character subsets can be specified in SETS block
  3. PAUP use of "utree" command instead of the [&U] comment before the newick tree definition.
  4. putting species names in comments instead of in the NOTES block

begin TAXA; 
    dimensions ntax=2;
    taxlabels AAF59352 [Drosophila melanogaster] T07660 [Solanum tuberosum];
end;

This can be done using the NOTES block

begin NOTES; 
    text taxon=AAF59352 text='species=Drosophila melanogaster';
    text taxon=T07660 text='species=Solanum tuberosum';
end; 

Synopsis of current software with some support for NEXUS

NEXUS files are supported to varying degrees by the Bio* toolkits, by single-purpose libraries such as NCL, and by applications.

Libraries

  1. Nexus Class Library (NCL) (link to PubMed)
  2. Bio::NEXUS
  3. Bio::Phylo
  4. NEXUS API in C with wrappers for Fortran 77, Fortran 90 and Java. Interface also described as IDL.

Toolkits

  1. BioPerl The SimpleAlign alignment module was modified to make it annotatable (AnnotatableI-compliant) and capable of containing features (FeatureHolderI-compliant). This makes further work possible, such as allowing the addition of tree objects to an alignment, analogous to the way the NEXUS format allows the packaging of trees and alignments together.
  2. BioJava The BioJava group created a Level I- and Level II-compliant NEXUS parser based on ideas from JEBL. This parser implements the DATA, CHARACTERS, DISTANCES, TREES, and TAXA blocks.
  3. BioRuby The BioRuby group developed a new NEXUS parser. The parser returns trees, data, characters, distances, and taxa as objects which allow to access the individual data fields (such as number of characters). Trees can be either returned as Newick-parsed tree objects or as strings. Sequences can be either returned as sequence objects or as strings. Distances are returned as matrix. Other blocks are returned as generic blocks which can return their content in tokenized form. Blocks can also be returned as NEXUS-formatted strings. Documentation and unit tests have been submitted as well.
  4. BioSQL The BioSQL group created a new set of tables, optional within BioSQL, for the purposes of storing phylogenetic trees, both sequence or species. A script was added to the package that allows one to read NEXUS files and write any trees found in the files to the database.

Applications

  1. Clustalw
  2. Dendroscope
  3. HyPhy - somewhere between level 2 and 3 support in the built in data reader function
  4. MacClade
  5. Mesquite
  6. MrBayes
  7. PAUP
  8. SplitsTree

Supporting the Current Standard via Conformance Testing

Minimally we need one or more standards or references (a standard NEXUS file example, a reference implementation of parsing), and we need a set of one or more metrics to evaluate conformance to the standard.

One obvious metric is simply the number of blocks, commands, or reserved tokens that are understood. The NEXUS Specification page provides a table that lists 119 (all?) keyword-block combinations. Currently Bio::NEXUS implements 67 % of these, according to the documentation. This list does not include the syntax used for formatting display names.

However, the blocks and commands are not all equal. Few people use the codons, unaligned, or notes blocks. For this and other reasons, it seems useful to designate levels of conformance. Conformance has to be tested in different ways for files (is the file a valid NEXUS file?), applications (does the file read and write NEXUS files?) and libraries. To test conformance of files, we need a validating parser. To test conformance of implementations, we need a set of test files.

Levels of conformance

Provisional specification for levels of conformance (examples are guesswork)

  • level I - barely useful NEXUS implementation (e.g., clustalw)
    • Data block - reads and writes
    • May fail to comply with rules for syntax and grammar
    • May fail to parse more complex files
  • level II - preferred minimum standard for NEXUS implementation (e.g., BioPerl)
    • Understands CHARACTERS, TAXA and TREES blocks
    • Will read (but not necessarily understand) level III files
    • May limit names to < 255 chars
  • level III - handles widely used data types, allows private blocks (? Mesquite, Bio::NEXUS)
    • Understands TAXA, CHARACTERS, DISTANCES, UNALIGNED, SETS, ASSUMPTIONS, CODONS, TREES, NOTES
    • processes all widely used commands in the above listed blocks
    • Will read (but not necessarily understand) any properly formed file
  • level IV - complete: understands public blocks and commands in all properly formed files

Testing conformance of NEXUS files

We want to set up a web service for this, but we are waiting to set up a validating parser based on the formal BNF grammar provided by Iglesias, et al 2001.

Meanwhile, you could try one of two things:

  1. upload the file to Nexplorer a browser for NEXUS files
  2. attempt to read the file with a level III parser such as MacClade, Mesquite or Bio::NEXUS.

Here is a one-liner for reading and re-writing with Bio::NEXUS.

perl -MBio::NEXUS -e 'new Bio::NEXUS("t/data/testaln.nexus")->write("test.nex")' 

Conformance of applications or library software

The simplest kind of test that we can do with either an application or a library is a round-trip test, which tests whether the target software (the software to be tested) can read a NEXUS file and write a new NEXUS with the same content:

  1. read in the file and parse into some object stored in memory
  2. write out the object in NEXUS format
  3. compare the files to see if they have identical content (Bio::NEXUS provides tools for this)

We can automate this process by writing a wrapper with the interface:

 wrapper in.nexus > out.nexus

For instance, the latest version of ncl has code in the "examples" subdirectory to build a program called "normalizer" that takes an input NEXUS file and writes out a NEXUS file on STDOUT.

<perl> use strict; use Bio::NEXUS;

my $test_program = "normalizer";

my $file = shift @ARGV; my $testfile = $file.$test_program;

system("$test_program $file > $outfile");

my $nexus = new Bio::NEXUS($file, 0); my $other_nexus = new Bio::NEXUS( $testfile, 0);

unless ($nexus->equals($other_nexus)) {

    print "content of files is not the same\n";

} exit; </perl>

Test files for NEXUS parsers

NEXUS files with UN*X, DOS or Mac line endings

LevelResultTestDescription
Ibasic-bush-unixnewline = linefeed (ASCII 0x0A)
Ibasic-bush-dosnewline = carriage return (ASCII 0x0D)followed by linefeed (ASCII 0x0A)
Ibasic-bush-macnewline = carriage return (ASCII 0x0D)

NEXUS files with imaginary character data and a taxa block

These files (all in UN*X newline = linefeed format) present various minor challenges such as comments, quoted strings, multiple trees, etc. They are all based on files with TAXA, CHARACTERS, TREES and DISTANCES blocks. Nearly all of them are derived from the basic-bush or basic-ladder file.

LevelResultTestDescription
Ibasic-bushsymmetric bifurcating tree, all branch lengths = 1
Ibasic-rakebasal polytomy, all branch lengths = 1
Ibasic-laddermaximally asymmetric tree, branch lengths = time
Ilong-nameslabels up to 31 characters for taxlabel, charlabel, tree
Iquoted-strings2OTU name 'OTU C', charlabel 'Char 3', tree name 'the ladder tree'
Iradical-whitespaceridiculous but legal use of whitespace
Itop-level-commentcomment at the top level
Iintrablock-commentcomments within every block
Imultiline-intrablock-commentmulti-line comments within every block
Ichar-matrix-spaceschar data are in blocks separated by spaces
Ichar-interleavechar data are in interleaved format
Itrees-translatebush tree with translated names
Itrees-tree-bushbasic bush
Itrees-tree-ladderbasic ladder
Itrees-tree-rakebasic rake
Itrees-tree-bush-cladogrambush, no branch lengths
Itrees-tree-ladder-cladogramladder, no branch lengths
Itrees-tree-rake-cladogramrake, no branch lengths
Itrees-tree-bush-quoted-string-name1name of tree is "bush quoted string name1"
Itrees-tree-bush-quoted-string-name2name of tree is 'bush quoted string name2'
Itrees-tree-bush-branchlength-zerobranch to OTU C with zero length
INDtrees-tree-bush-branchlength-scientificsci format for length of branches to OTU B (2e+01), C (9e-01), F (9E-01), G (2E+01)
Itrees-tree-bush-branchlength-negativebranch to CD ancestor with length = -0.25
Itrees-tree-bush-inode-labelsinternal node labels
Itrees-tree-bush-inode-labels-partialonly some internal node labels
Itrees-tree-bush-inode-labels-quoted1double quotation marks around inode labels
Itrees-tree-bush-inode-labels-quoted2single quotation marks around inode labels
Itrees-tree-basal-trifurcationbasal trifurcation
Itrees-tree-bush-extended-root-branchbush with branch above root node
Itrees-tree-multipletrees "bush", "ladder" and "rake" in one trees block
Itrees-tree-multiple-challengesone tree with each challenge above (see the [#treelist treelist</a> below)
IIreally-long-nameslabels up to 255 characters for taxlabel, charlabel, tree
IIIdist_rect.nexNon-interleaved, rectangular distances block with diagonal entries
IIIdist_lower.nexNon-interleaved, lower-triangular distances block with diagonal entries
IIIdist_lower_no-diag.nexNon-interleaved, lower-triangular distances block with no diagonal entries
IIIdist_upper.nexNon-interleaved, upper-triangular distances block with diagonal entries
IIIdist_upper_no-diag.nexNon-interleaved, upper-triangular distances block with no diagonal entries
III dist_interleave_rect_nodiag.nexInterleaved, rectangular distances block with no diagonal entries
IIIdist_interleave_lower.nexInterleaved, lower-triangular distances block with diagonal entries
IIIdist_interleave_lower_nodiag.nexInterleaved, lower-triangular distances block with no diagonal entries
III dist_interleave_upper.nexInterleaved, upper-triangular distances block with diagonal entries
IIIdist_interleave_upper_nodiag.nexInterleaved, upper-triangular distances block with no diagonal entries

Trees are shown exactly as in the "tree" command in NEXUS trees block.

  • tree basal_trifurcation = (((A:1,B:1):1,(C:1,D:1):1):1,(E:1,F:1):2,(G:1,H:1):2);
  • tree bush = (((A:1,B:1):1,(C:1,D:1):1):1,((E:1,F:1):1,(G:1,H:1):1):1);
  • tree bush_branchlength_negative = (((A:1,B:1):1,(C:1,D:1):_0.25):1,((E:1,F:1):1,(G:1,H:1):1):1);
  • tree bush_branchlength_scientific = (((A:1,B:2e+01):1,(C:9e_01,D:1):1):1,((E:1,F:9E_01):1,(G:2E+01,H:1):1):1);
  • tree bush_branchlength_zero = (((A:1,B:1):1,(C:0,D:1):1):1,((E:1,F:1):1,(G:1,H:1):1):1);
  • tree bush_cladogram = (((A,B),(C,D)),((E,F),(G,H)));
  • tree bush_extended_root_branch = (((A:1,B:1):1,(C:1,D:1):1):1,((E:1,F:1):1,(G:1,H:1):1):1):1;
  • tree bush_inode_labels = (((A:1,B:1)AB:1,(C:1,D:1)CD:1)ABCD:1,((E:1,F:1)EF:1,(G:1,H:1)GH:1)EFGH:1);
  • tree bush_inode_labels_partial = (((A:1,B:1):1,(C:1,D:1):1):1,((E:1,F:1)EF:1,(G:1,H:1)GH:1)EFGH:1);
  • tree bush_inode_labels_quoted2 = (((A:1,B:1)'inode AB':1,(C:1,D:1)'inode CD':1)'inode ABCD':1,((E:1,F:1)'inode EF':1,(G:1,H:1)'inode GH':1)'inode EFGH':1);
  • tree 'bush quoted string name2' = (((A:1,B:1):1,(C:1,D:1):1):1,((E:1,F:1):1,(G:1,H:1):1):1);
  • tree bush_uneven = (((A:1,B:2):1,(C:1,D:2):1):1,((E:1,F:2):1,(G:1,H:2):1):1);
  • tree ladder = (((((((A:1,B:1):1,C:2):1,D:3):1,E:4):1,F:5):1,G:6):1,H:7);
  • tree ladder_cladogram = (((((((A,B),C),D),E),F),G),H);
  • tree ladder_uneven = (((((((A:1,B:2):1,C:2):1,D:4):1,E:4):1,F:6):1,G:6):1,H:8);
  • tree rake = (A:1,B:1,C:1,D:1,E:1,F:1,G:1,H:1);
  • tree rake_cladogram = (A,B,C,D,E,F,G,H);

NEXUS files of DNA or protein data with a taxa block

TAXA, CHARACTERS, TREES. May contain an ASSUMPTIONS block with character weigths.

LevelResultTestDescription
IISPAN_Family1nlwSPAN export file, family 1, nucleotides, weights in ASSUMPTIONS block. Internal node labels.
IIISPAN_Family2alwSPAN export file, family 2, amino acids, weights in ASSUMPTIONS block. Internal node labels.
IIISPAN_Family4nlSPAN export file, family 4, nucleotides. Internal node labels.
IIISPAN_Family5alwSPAN export file, family 5, amino acids, weights in ASSUMPTIONS block. Internal node labels.
IIISPAN_Family7nSPAN export file, family 7, nucleotides. Internal node labels.
IIISPAN_Family8aSPAN export file, family 8, amino acids. Internal node labels.
IIIFaresWolfeCCTCCT chaperonin amino acids, tree; normalized by nextool; data from Mario Fares (Fares & Wolfe, 2003, Mol Biol Evol. 20: 1588-97)

NEXUS files of DNA or protein data with NO taxa block

OTUs are defined implicitly in the data block instead.

LevelResultTestDescription
UnaSmithHIV-bothHIV sequences. fusion of NEXUS data and trees blocks from separate files supplied by Una Smith. No branch lengths.
barns-combinedfusion of two files from Chuck Delwiche. No TAXA block; tree & data from separate files, names may not match
Bird_OvomucoidsMacClade example file. No TAXA block. Trees without branch lengths.
Human_mt_DNAMacClade example file. No TAXA block. Large data set. Trees without branch lengths. Multiple trees.
Kingdoms_DNAMacClade example file. No TAXA block. Trees without branch lengths.
Marsupial_WolfMacClade example file. No TAXA block. Trees without branch lengths.
Primate_mtDNAMacClade example file. No TAXA block. Trees without branch lengths. Multiple trees.
Omland-Oriolesnucleotide data from Kevin Omland. No TAXA, no TREES blocks. Nucleotides are written in sets of three, with spaces between each codon. Names may not match within data block.
Omland-Ravensnucleotide data from Kevin Omland. No TAXA, no TREES blocks. Nucleotides are written in sets of three, with spaces between each codon.
Treebase-liverwort-rbclTreebase export. rbcL sequence data with tree from analysis of liverwort phylogeny by Lewis, et al., 1997. No TAXA block. No branch lengths.
Treebase-horsetails-dnadna sequence data with tree. No TAXA block. No branch lengths.
Treebase-chlamy-dnadna sequence data with tree. No TAXA block. No branch lengths.

NEXUS files of non-DNA non-protein data with a taxa block

LevelResultTestDescription
SPAN_Family3ilSPAN export file, family 1, intron presence/absence data. Internal node labels.
SPAN_Family6iwSPAN export file, family 6, intron presence/absence data, weights in ASSUMPTIONS block. Internal node labels.
SPAN_Family9iwSPAN export file, family 9, intron presence/absence data, weights in ASSUMPTIONS block. Internal node labels.
SPAN_Family10iSPAN export file, family 10, intron presence/absence data. Internal node labels.

NEXUS files of non-DNA non-protein data with NO taxa block

OTUs are defined implicitly in the data block instead

LevelResultTestDescription
Treebase-horsetails-classicclassic character data with tree. No TAXA block. No branch lengths.

Non-conformant files

Several of the test files have illegal elements, which are documented in the following table. This list allows parsers to check their error-handling capabilities. This list is not exclusive because not all files have been checked. Page numbers are from the NEXUS standard, Maddison, Swofford and Maddison (1997). Temporary workspace created by Paul to avoid conflicts NonConformantWorkspace.

filename caveats
barns-combined
  • double-quote is not a valid gap symbol (p. 599)
  • datatype must be first in format command (p. 599)
char-interleave
  • datatype must be first in format command (p. 599)
Omland-Ravens
  • taxon names not identical across interleave pages (p. 603)
quoted-strings1 single (not double) quotes should be used to define tokens with embedded spaces; double-quoted tokens used in taxlabel, charlabel, matrix and tree commands (p. 621)

Supporting NEXUS via Enhanced Data IO

The usefulness of NEXUS files increases to the extent that users (including power-users) find it easy to store and exchange data in NEXUS format, and to move data in and out of NEXUS files. The text-based format of NEXUS makes it very portable between different platforms, and also makes it easy to manage version via source code control (e.g., cvs).

Translation to other file formats

to do: fill in Bio::SeqAlignI using Bio::NEXUS. This will provide immediate access to diverse alignment formats.

Relational schema

Does anyone want to try this? Probably it would clarify the concepts involved and would suggest the kinds of rules consistent with normalization.

Application Support

Editing

Manipulation

Visualization

Clarifying, extending, and replacing NEXUS as a standard

Note: also see a proposal for formalization of NEXUS extensions by David Maddison.

Some questions

  • Do business rules allow (or require) enacting equate irreversibly?
  • Are "comments" comments or annotations?
  • Are square-bracketed strings in trees block comments?
  • Are redundant blocks allowed? How should the user represent mixed data?
  • Arbitrary annotations to nodes (as in NHX) or branches (extends NOTES?)
  • What if ntax says 20 but the characters block has 12 otus?
  • What if symbols are defined ambiguously, e.g., gap = missing = -?
  • Rooted and unrooted trees?

The X symbol

We need to clarify the issue of what an X means in DNA or Protein. Most people seem to use it as a code for ambiguity. The IUB code allows this.

In DNA:

X=>{A,C,G,T}
N=>{A,C,G,T}
?=>{A,C,G,T,-}

According to the NEXUS standard X should be treated as an N, and not a ?, but both PAUP and Mesquite seem to be treating it as ?.

In amino acids:

?=> {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,*,-}

PAUP maps X=> {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,*,-} In NCL we will not recognize X in amino acids. It is not clear to me if X should contain * or -. There is also a move to make X stand for selenocysteine [[1]]

Appendix: Analysis of TreeBase submissions

7642 submissions to TreeBase. Converted from Mac to UNIX format using

# convert file in place, save backup with .orig extension
perl -p -i.orig -e 's/\r/\n/g' 

This still left 529 files apparently 0 lines in length. When I checked I found that some have apparently been translated with HTML escape codes for spaces, tabs and line-endings.

# shell command uses grep to find bad files, fix with perl, leave *.orig as backup
for f in `egrep -l ";\&\#" *UL`; do perl -p -i.orig -e 's/\&\#9;/ /g; 
      s/\&\#1[03];(\&\#1[03];)*/\n/g; s/\&\#[0-9][0-9]*/g' $f; done

This still leaves 129 files of length 0, i.e., no newline. There are also 22 files of length 1 line.

Overall, a very substantial fraction of these files, certainly over 10 %, are not valid NEXUS files. If this is a representative sample of user implementations of NEXUS files, we are in big trouble. I even saw a FASTA file in the batch. Over 500 files have the "#NEXUS" token more than once! Over 5699 of them, i.e., the majority, use the deprecated DATA block. Only 734 seem to have a TAXA block. This means that there are over 800 files with neither a DATA block nor a TAXA block. Some examples follow.

Illegal OTU names (should not start with numbers, should not contain dashes unless protected in single-quoted string)

begin data;
 dimensions ntax=154 nchar=317;
 format datatype=dna interleave missing=-;
  matrix
   187-87  AACCC-TTTGTGAACTTATA CCTAT-C-GTTGCC-TCGGC G-AAGG-CCGGCC-AGGCCC 
       I4  AACCC-TTTGTGAACTTATA CCTAT-CTGTTGCC-TCGGC G-AAGG-CCGGCC-AGGCCC 
    21-99  AACCC-TTTGTGAACTTATA CCT-TACTGTTGCC-TCGGC GC-AGG-CCGGCC-GGGCCC 

References

Huelsenbeck, J.P. and F. Ronquist. 2001. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17: 754-755.

Huson, D.H. and D. Bryant, Application of Phylogenetic Networks in Evolutionary Studies, Mol. Biol. Evol., 23(2):254-267, 2006. SplitsTree4

Huson, D. H., D. C Richter, C. Rausch, T., M. Franz and R. Rupp. Dendroscope: An interactive viewer for large phylogenetic trees . BMC Bioinformatics 8:460, 2007. Dendroscope

Lewis, P.O. NCL: A C++ class library for interpreting data files in NEXUS format. Bioinformatics 19: 2330-2331. Sourceforge

Iglesias, J.R., G. Gupta, D. Ranjan, E. Pontelli, and B. Milligan. 2001. Logic Programming Technology for Interoperability between Bioinformatics Software Tools. In Symposium on Practical aspects of Declarative Languages, pp. 153-168. Springer-Verlag.

Maddison, D.R., D.L. Swofford, and W.P. Maddison. 1997. NEXUS: an extendible file format for systematic information. Systematic Biology 46: 590-621.

Maddison, W.P. and D.R. Maddison. 1992. MacClade, Version 3: Analysis of Phylogeny and Character Evolution. Sinauer Associates, Sunderland, Mass.

Maddison, W. and D. Maddison. 2000. Mesquite: A modular Programming System for Evolutionary Analysis. University of Arizona, http://spiders.arizona.edu/mesquite.

Morell, V. 1996. TreeBASE: The roots of phylogeny. Science 273: 569-569.