R Hackathon 1/Data Standards

From Phyloinformatics
Jump to: navigation, search

Here is the page for discussing the erection of the data standards for R

Requirements

Tree standard

  • rooting
  • branch lengths
  • tree weights
  • information about ancestral state reconstructions/assignments
  • labels

Network standard

  • same as tree, plus...
  • branch widths? (eg. 90% of the genes in a hybrid come from parent A, 10% come from parent B)

Continuous data

  • real numbers (i.e., not just non-negative)
  • allow information about sample size, range, variance? (described in NEXUS standard but rarely if ever used)
  • character labels ("body mass")

Discrete data

  • different number of states at different sites
  • character labels ("wings")
  • state labels (0=wingless, 1=winged)

Model descriptions

See also the Transition Model Language specification being developed by the EvoInformatics Working Group at NESCent.

  • continuous model (eg. OU, BM)
  • discrete model (eg.HKY, irreversible)

Analysis settings

  •  ?

Proposed standards

Original Discussion

Tree

Name (Package) Pros Cons
phylo (APE)
  • Already convertible (and documented)
  • Can be read from/written to Newick and NEXUS files
  • extensive plotting functionalities
  • tree manipulations (root, unroot, drop tips, graft trees)
  • other high-level functions (resolving multichotomies, extracting branching times, testing for rootedness, ...)
  • internal structures can be passed to C easily
  • tree-building methods from distances (NJ, BIONJ, FasteME)
  • some tree dating methods (discussion has already started on improving them)
  • extensive comparative and diversification methods
  • can be extended easily
  • functions to get bipartitions from a list of trees (recently improved with C code so that getting frequencies can be fast).
  • Maybe not as efficient to handle in C than the traditional "struc of pointers" (this needs to be checked in fact)
    • My experience is that this is the case. However, the c format really requires recursive functions, which (in my experience) are very slow in r.--Lukeh@uidaho.edu 03:02, 30 November 2007 (EST)
      • Recursive R functions can be very fast. Maybe you mean "handling recursive structures in R" like lists? I really meant handling the 'edge' matrix in C. R has no pointers, so the comparison is asymmetric. --Emmanuel 13:01, 30 November 2007 (EST)
  • tree manipulations actually limited
phylog (ADE4)
  • Various methods/graphics implemented: plot.phylog, radial.phylog
  • matrices of proximities among taxa defined by Abouheif's test
  • representation of quantitative traits along with phylogeny (dotchart.phylog, symbols.phylog, table.phylog)
  • Conversion from newick format, but also hclust (hierarchical clustering) or taxo (taxonomy).
  • Heavy: instead of simply containing data of a phylogeny, the class 'phylog' stores various other information (Abouheif's 'A' matrix, distance matrix, etc.).

Networks

  • Are these networks similar to the spatial ones? If yes, the class 'nb' from the package 'spdep' is a standard. If no, how do they differ? (I don't think they're similar: a network in this context is something that would be a tree but for reticulation (i.e., a phylogeny of sunflowers where one hybrid sunflower species has two parent species, forming a cycle in the graph), not a set of geographic localities. There are some references at Wikipedia --Bco 14:29, 20 November 2007 (EST))

I've got interest in this, but had no time to dig... --Emmanuel 11:03, 6 December 2007 (EST)

Continuous data

It would be useful to standardize continuous data (and discrete data) as vectors with labels, where the labels are identical to the tip labels. This would allow users to submit the same data vector for different subtrees, without continually subsetting and double-checking to make sure that the data and the tips are in the same order.

  • That's really standard R stuff, and used a lot in ape at different levels. Something probably useful to consider would be matching lists of names considering fuzzyness, so that "Home sapiens" would be proposed as a match to "Homo sapiens". --Emmanuel 13:12, 30 November 2007 (EST)

Discrete data

Sounds that the 'factor' data type in R is ideal for this: data are stored internally as integers, and an additional attribute stores the level labels. There are many functions to manipulate factors in R. --Emmanuel 07:17, 22 November 2007 (EST)

Model descriptions

Analysis settings

New standard 1

S4 class(es). Tree class is just phylo converted to S4. Question: Should we standardize on node ordering (for easy pre- or posttree traversal?) --Bco 17:48, 30 November 2007 (EST)

Flavor 1

  • Tree class
  • Class for data (labeled vector)
  • Combined tree+data class

Flavor 2

  • Tree class
  • Class for data (data frame)
  • Combined tree+data class

Flavor 3

  • Tree class only

Comments

  • Advantage of a class including (tree+data): consistency between data and tree can be checked easily, especially since it is S4.
  • A class including (tree+data) should be able to store any kind of data: list (and therefore data.frames) are acceptable, maybe not vectors.

--Jombart 04:56, 3 December 2007 (EST)

  • If we decide that the (tree+data) class is the preferred standard for getting data into and out of functions, we might nonetheless give users some flexibility by creating an easy generating function that accepts a data.frame or labeled vector for the data and a phylo object for the tree, then creates the extended object on the fly. This would allow users to use a single dataset for all analyses, as the generating function would standardize the object.
  • If the (tree+data) class is a straightforward multiple extension of a data class (currently not extant) and the phylo class, with some format-checking to help users out, then it would be very simple for advanced users to generate this class on the fly themselves, which might speed analyses of large numbers of datasets and trees up by allowing the user to decide what needs to be pruned or rearranged rather than making the generating function go through this decision-making process at every tree-dataset combination.

ahipp 11:12, 5 December 2007 (EST)

As an example, here's how Aaron designed the tree and data representation in OUCH. The tree is stored in an "ouchtree" object (S4) created by the ouchtree() function.

  • The user makes and ouchtree by supplying three pieces of information: node names, corresponding ancestors, and times of the nodes.
  • This information can be supplied via a dataframe or three vectors, so long as there is a rownames() or a names() attribute with the node names (thus, the order of the nodes in the vectors doesn't have to match up, since they are checked for the proper node name).

The actual ouchtree object contains a number of slots (variables), which are bits of information gotten from parsing the tree, useful for OUCH's computations (e.g., branch times, which are tips, epochs, etc.; but the user doesn't need to worry about these.

The ouchtree class is extended by "hansentree" and "browntree" classes.

  • These objects are created when the user fits an OU or a Brownian-motion model to the data and tree by calls to the hansen() or brown() functions.
  • These functions require as arguments the data (with node labels), the tree (ouchtree object), the model that the user wishes to fit, and options.

The hansentree and browntree objects summarize the entire analysis, as they contain not only the

  • tree
  • data
  • regimes (these are our hypothesized arrangement of selective regimes on the phylogeny)
  • but also the call (containing the model and all specified parameters)
  • all of parameter estimates + likelihood
  • optimization method chosen, etc.

Thus it is easy to (1) see exactly what you did, (2) rerun the analysis, or (3) modify the calls to brown() or hansen() using R's update() function or whatever.

So, in terms of generalizing, an easy to implement and straightforward solution might be to have a basic tree class and an agreed upon format for data (vectors with node labels in the names() attribute, for example).

  • Developers can either extend the tree class to suit their own needs,
  • And/or write translation code between their format and the standard
  • The only issue is whether you want to pass output from one package to another. Are there any output that it would be common to swap back and forth with? Other than the trees or data themselves, I can't really think of anything.

--Marguerite 02:59, 6 December 2007 (EST)

New standard 2

ANYONE?

Chosen standard

Generalized analyses (uses of objects by endusers)

Here, I tried to describe the generalized tasks that users may want to perform with the proposed objects (or what kinds of analyses will comparative biologists run, and what would be helpful?). Please comment.

Inputs:

  • one tree, one dataset -> standard comparative analysis
  • one tree, multiple data vectors ->
    • simulated data to generate distribution of scores for statistical tests
    • multivariate comparative study? Still challenging, may involve some form of data reduction (PCA) or model reduction (of the evolutionary model)
      • there is an example of something like this in the Geiger function ic.sigma --Lukeh@uidaho.edu 00:00, 13 December 2007 (EST)
    • data mining (scary, statistically speaking)
    • additional input vectors which map to the tree (for example, adaptive regimes -- statisically, these are parameters to the model and therefore form part of the hypothesis, but in terms of computation, they are input parameters, and thus look like data objects)
  • multiple trees, one dataset -> test of phylogenetic uncertainty

Outputs:

  • useful to have the phylogeny, data, and output wrapped together, especially if models are fit. If independent contrasts, then useful to have contrasts wrapped with output. There is probably something that makes sense as an "output" for each type of "processing" or model fitting.
  • Specific output can have relevant summary statements.
  • separate plot methods for data ouput (i.e. pic or ancestral character states) and tree (plain tree, or tree with some form of data or adaptive regimes or location of suspected variation in evolutionary rate added -- i.e., some form of variation in the evolutionary process that enters the analysis as a hypothesis) (probably done by each package developer, but maybe some generic makes sense? which would be easier for people to extend?).

Print/plot of input:

  • print/show data, mainly for checking, make sure input was translated correctly.
  • print/show tree, mainly for checking
  • plot tree (again, needed for checking) -- may need to plot tree with node numbers for editing/pruning the tree or other manipulations

--Marguerite 00:38, 12 December 2007 (EST)

Components and methods of data standards

In the following two sections, I just try to formalize further the proposal of data standard. Please correct, comment, etc. All is intended in S4 paradigm.

Classes and corresponding slots

  • Class phylo4

i.e. class phylo converted to S4; slots description taken from 'read.tree' manpage.

    • @edge (required): a two-column matrix of mode numeric where each row represents an edge of the tree; the nodes and the tips are symbolized with numbers; the tips are numbered 1, 2, ..., and the nodes are numbered after the tips. For each row, the first column gives the ancestor. REQUIRED. We think edge will always be stored in "cladewise"/Newick traversal order, but eventually (after we steal reorder from ape), the accessor method for edges should allow the user to pull them out in a different order. (This would be inefficient if one were doing a lot of processing within phylo4.)
    • @edge.length (optional): a numeric vector giving the lengths of the branches given by 'edge'. May have length 0 (no branch lengths, in which case the $ operator will return NULL (so it will work with existing ape methods).
    • @Nnode (required, but computed by the constructor): the number of (internal) nodes
    • @tip.label (required, but set by default): a vector of mode character giving the names of the tips; the order of the names in this vector corresponds to the (positive) number in 'edge'. Set to T1:T{nTips} by default.
    • @node.label (required, but set by default): a vector of mode character giving the names of the nodes; set to N1:N{nNodes} by default. TRICKY DECISION BECAUSE: It adds overhead to the phylo object automatically that would presumably be exported with the tree (likely undesirable). An additional issue could occur when attempting to plot a tree, if we fill missing node labels, plots will be polluted with generic label names. The primary benefit is that automatically adding node labels is that keeping track of nodes, even through manipulation, is easier. --Pdc 16:19, 12 December 2007 (EST) You must have node labels or things will get very confusing when trying to associate the data with the correct part of the tree. With regard to plotting, couldn't there could be a toggle for plotting node labels (default=off)? --Marguerite 18:04, 13 December 2007 (EST) I think this is settled now. The node labels are set by default only if the node labels are not specified on creation. However, node labels provided by the user don't have to be unique (e.g. c("mammals","","","","","birds","")). When we rearrange the tree (i.e. subset/prune/drop.tips), we create "tags" for the nodes that are guaranteed to be unique, so we can do the bookkeeping properly, but those are created and discarded every time we rearrange. --Bbolker 10:36, 14 December 2007 (EST)
    • @edge.label (required, but set by default): a vector of mode character giving the names of the nodes; set to values corresponding to the descendant by default. Isn't this just tip label + node label? --Marguerite 18:04, 13 December 2007 (EST) No, because there are situations where you might want to label edges and internal nodes separately (and unlike the data, you can't just make one column each for nodes and edges). --Bbolker 10:36, 14 December 2007 (EST)
    • @root.edge:(optional) a numeric value giving the length of the branch at the root if it exists. NA if none, $ operator returns NULL.
    • Perhaps also a tree weight slot? This could store bootstrap weight, AIC weight, or posterior probability, etc. --Bco 10:28, 6 December 2007 (EST) This is worth thinking about. --Bbolker 10:36, 14 December 2007 (EST)
  • Class phylo4d

storing comparative data, i.e. at least one phylogeny and one variable. Inherits the 'phylo4' class and adds slots for data.

    • @tree: an object of class 'phylo4'
    • @tipdata: a data.frame giving values for the tips, possibly (???) with attributes "type" (a factor with slightly extended information from the Nexus type data), "comment" (character vector of length 0, 1?, or equal to number of tips), "metadata" (an arbitrary list)
    • @nodedata: same but for nodes/edges

The @dat slot should be given some thought, as it needs to include (minimally) the mean or median and some estimate of measurement error; it should also probably include sample size for methods that use measurement error estimates or estimate measurement error directly. A data frame would provide the most flexibility, though it could be more complicated. For example, the data-only class could have as one slot the data frame with mean values and measurement error (0 if measurement error is being ignored) and sample size; a second optional slot containing the raw data in whatever form, to be drawn on by different methods as needed and to give the user a convenient way of keeping raw and summarized data together in the R project; and a third slot allowing comments on the data source (a character vector; this of course could also be accomplished by commenting the data-only object). ahipp 09:20, 7 December 2007 (EST)

If we have estimates of measurements of error for a trait, these could simply be added as another column ... --Bbolker 00:03, 13 December 2007 (EST)

Probably should have the ability to enter data either as vectors or as data frames. It can be frustrating to a new user if the function doesn't work just because they don't understand the difference. The vectors can be immediately turned into data frames internally. --Marguerite 00:43, 12 December 2007 (EST)

Here, in somewhat excruciating detail, are the design details behind the decision that we have reached to include the data slot simply as a data frame for the time being, with or without attributes that give types, comments, and other metadata. The two big challenges are (1) handling molecular data [potentially large and with specialized methods etc.] in conjunction with non-molecular data and (2) incorporating metadata while maintaining a clean, simply accessible design.

    • Multiple vs single slots: we could simply have several slots for each type of data (tip, node, edge?): e.g. @tip.data, @tip.molec.data, @tip.type, @tip.comments, @tip.metadata, but this leads to a lot of slots -- at least double this (one set for tips and one for internal nodes, if not for edges)
    • Single slots: if we use a single slot for each piece of data (tip data, node data), then we have several further decisions to make about packing in metadata and molec/other data
      • Extended data frame: we would ideally like a more flexible form of data frame (data frames can only include certain basic types --- numeric/integer/character/logical plus Dates and factors) that could include a vector (list) of any type provided that it was the appropriate length and had appropriate accessor methods. The concern is that building this class so that it was robust and provided enough of the functionality we can get automatically from data.frames (i.e. row subsetting, summary method, etc.) is too hard to do quickly.
      • data.frame with attributes: we can simply store (e.g.) alignments as character vectors, each alignment in one column of the data frame. This is a bit inefficient (8 bits per base pair vs 2 really needed), and clunky for real operations on genetic data, but maybe not too bad as a temporary storage mode? We can add attributes to tag types, comments, metadata, but this loses some of the benefits of S4 classes.
      • class extending a data frame: in principle, we should be able to extend a data frame by using something like setClass("pdata",representation(type="factor",comment="character",metadata="list"),contains="data.frame"), but this doesn't actually work -- argh. (It does seem to work with integers.) Best suggestion from r-devel is to have a slot @data and design the accessor methods so that [, [[ go immediately to @data[, @data[[ .
      • raw data.frame: pack molecular data in as character, forget about the metadata for now.

I lean towards the data frame with attributes (or, if possible, the extended data frame class).


  • Class multiPhylo4

Currently defined as

    • @phylolist: a (possibly named) list of phylo4 or phylo4d objects. If they are phylo4d objects, then the tip.data

slots (but not necessarily the node.data slots) are expected to be empty

    • @tip.data: tip data
    • @tip.label: tip labels

We have done very little with the multiPhylo4 class (conversion to and from multiPhylo is in principle implemented, but completely untested; didn't bother writing conversions from multi.tree since that will be going away), but the structure seems sensible.

Definitely needed, already exists in ape as c("multi.tree", "phylo") (but see me comment on the Data Reps page). --Emmanuel 10:58, 6 December 2007 (EST)

To be defined by Emmanuel, Damien, etc.

You can have an attribute 'labels' on the list, and the existence of it can be tested (that's what is done with the class "prop.part"). --Emmanuel 10:58, 6 December 2007 (EST)

    • @...

Umm I'm not comfortable with the use of the @ operator in R, but it seems that a standard list would be simpler here, so that length(x) and names(x) would give you the same information than the two first slots, and reordering, sorting, extracting can be done easily with the usual [ and [[ operators. Of course, such a list could be embedded in a S4 class. --Emmanuel 10:58, 6 December 2007 (EST)

Methods

  • Class phylo4
    • .validPhylo4: a routine checking the validity of a phylo4 object, called by setValidity. [This is handled automatically by "validObject()", although more stringent checks are probably needed: check_phylo4 is the actual function --Bbolker 21:22, 28 December 2007 (EST)]
    • is.phylo4: function checking if an object inherits the class 'phylo4' and checks its validity [not needed: is(x,"phylo4") works automatically]
    • as(...,"phylo4"): conversion to phylo4
    • methods show/plot/summary: taken from ape standards but set as S4 methods (setMethods(...)). Can be refined, need to ask to endusers what is to be implemented.
    • na.omit: same as the generic.
    • plot() same as phylo?
  • Class phylo4d
    • .validphylo4d: a routine checking the validity of a phylo4 object, called by setValidity. In particular, the length or dimensions of the @dat can be compared to the number of taxa in @phylo4 Note: also should check that nodes and data match.
    • is.phylo4d: same as for the class phylo4
    • as.phylo4d: definitely useful, can take two args, or more (say k) and the 2,...,k are merged in @dat.
    • na.omit: same as the generic.
    • tdata: get data back out of phylo4d object
    • plot(): plot of dataframes (bivariate plots for checking)? [hmmm. not implemented ... --Bbolker 21:22, 28 December 2007 (EST)]
    • plot(): tree method? [we use plot.phylog for this --Bbolker 21:22, 28 December 2007 (EST)]
    • is phylo4d meant as an input class or an output class? Or perhaps a (hidden) base output class that is extended by package developers for their specific analyses (a convenient template)? --Marguerite 01:04, 12 December 2007 (EST) [I'm not sure what you mean here --Bbolker 21:22, 28 December 2007 (EST)]
  • Class multiPhylo4
    • .validMultiphylo4: as previously.
    • na.omit: same as the generic.
    • What will plot return for a multitree, or should it be prevented?--Marguerite 01:04, 12 December 2007 (EST) [Emmanuel has written a plot.multiPhylo, so we can use that ... --Bbolker 21:22, 28 December 2007 (EST)]

--Jombart 10:59, 11 December 2007 (EST)