Difference between revisions of "R Hackathon 1/Phylobase"

From Phyloinformatics
Jump to: navigation, search
(major edits 1)
(major edits 2)
Line 11: Line 11:
 
=== Finish data classes ===
 
=== Finish data classes ===
  
* '''Data model'''There seems to be consensus that the data model should be at least a little bit richer than a straight data frame, that we should create a new data class (there is some more discussion of this over in [[R Hackathon 1/Data Standards|Data Standards]]).  There's a draft standard in pdata.R in the current package.
+
* Tree model:
 +
** root node data characteristics (compare what OUCH does): root node doesn't have corresponding entry in the edge matrix:  ouch adds a row to the edge matrix (1-row data frame matching node column info). Not entirely clear how we should deal with this, MB has a proposal.
 +
 
 +
* Data model: There seems to be consensus that the data model should be at least a little bit richer than a straight data frame, that we should create a new data class (there is some more discussion of this over in [[R Hackathon 1/Data Standards|Data Standards]]).  There's a draft standard in pdata.R in the current package.
 
** "Basic" metadata (slot 'type' in the data class) is implemented as a factor with allowed levels ("multitype","binary","continuous","DNA","RNA","aacid","other","unknown"). There's an argument for sticking with the NEXUS types exactly, and one for extending them.
 
** "Basic" metadata (slot 'type' in the data class) is implemented as a factor with allowed levels ("multitype","binary","continuous","DNA","RNA","aacid","other","unknown"). There's an argument for sticking with the NEXUS types exactly, and one for extending them.
 
** "Extended" metadata (slot 'metadata' in the data class) is implemented as an arbitrary list.
 
** "Extended" metadata (slot 'metadata' in the data class) is implemented as an arbitrary list.
Line 39: Line 42:
 
* assigning edge and node labels -- is there/should there be an option to turn this off?  assign empty ("") labels?
 
* assigning edge and node labels -- is there/should there be an option to turn this off?  assign empty ("") labels?
 
* should default node labels start at N(nTips+1), to match internal node numbers?
 
* should default node labels start at N(nTips+1), to match internal node numbers?
* information returned by check_data isn't quite the same format as geiger's matching function. Clear enough?
+
 
 
* should coercing phylo4d to phylo4 give a warning message about losing data?  Should phylo4d to phylo4?
 
* should coercing phylo4d to phylo4 give a warning message about losing data?  Should phylo4d to phylo4?
 +
 +
=== Data checking ===
 
* should check_phylo4 check for cyclicity, or is this overkill?
 
* should check_phylo4 check for cyclicity, or is this overkill?
 +
* what other checks should there be on a phylo4 tree?  right now we check for (1) right number of edge lengths (if >0), (2) right number of tip labels, (3) at most one root, (4) only one ancestor per node.
 +
* information returned by check_data isn't quite the same format as geiger's matching function. Clear enough?
 +
* should there be an option for pruning tree to data? does this already exist?
 +
* option for fuzzy matching with agrep?
 +
* extend label-matching code to node labels
  
 
=== End users ===
 
=== End users ===
Line 52: Line 62:
 
* Does S4-style help access really work?
 
* Does S4-style help access really work?
 
* is the phylo4 constructor actually documented anywhere?
 
* is the phylo4 constructor actually documented anywhere?
 +
* work through Sidlauskas wiki page
  
 +
=== Plotting ===
  
 +
* other ways of plotting tree+data?  isoMDS + color space?
 +
* ggplot2 extension?  grid version?
 +
* modify phylo4 method to allow "Singles" (plot.phylo doesn't) ?
 +
* plotting with data and show.node.label scrambles order of node labels
 +
* plotting with subsetted data can throw error (see tests/plottest.R)
 +
* multiPhylo4x plotting
 +
* look for nice examples of ways people have plotted trees+data
  
 
=== Tree manipulation ===
 
=== Tree manipulation ===
  
We should look at the documentation group's lists of useful operations.
+
* Movement along branches, traversal, etc.: what operations are needed?
 
+
* General practice: how much should we work with/return internal node numbers, and how much node names/tags?  Current methods are a mishmash
* Movement along branches, traversal, etc.: set up a "tagged" form of a tree that saves the current position on the tree (as an attribute or a slot?) and can be modified by operations like Up, Down?
+
* allDescend gets tips only -- option to get list including internal nodes?
 
+
* Check O'Meara list of functions
Would it be more robust to have to specify a node in the function? So, for instance getAncestors(phy, 25) or getDescendents(phy, 25), where 25 is a node number. These could be used in recursions and it avoids the problem of maintaining or resetting a pointer on the tree. They would need to report NULL or something like it when falling off the end of the tree. --[[User:DavidOrme|DavidOrme]] 14:35, 12 December 2007 (EST)
+
* Check against Mesquite list of functions
 +
* from geiger: steal prune.extinct.taxa, prune.random.taxa?
 +
* Subsetting: by data characteristics or by tree
 +
** get subtree with all descendants of a node  
 +
** time slices: prune back by time ?
 +
* reordering (ordering may be an efficiency issue, but not for a while): all phylo4 edge matrices are internally stored as "cladewise"/Newick traversal order. Extend edges accessor method with optional argument that allows edges to be extracted in a different order (possibly with an attribute as in ape); need to steal 'reorder' from ape, extend for phylo4d keeping track of traits. Also see ape::ladderize
 +
* rooting
 +
** implement re-rooting
 +
** check with Sidlauskas: is there an "infelicity" in the way ape does it that we should modify?
  
 +
=== Other ===
 +
* options
 +
** phylo.compat: default=TRUE, allows accessing internal phylo4 elements with $ and $<-
 +
** check.data defaults
 +
** cf implementations in other packages (e.g. bbmle)
  
* Drop/prune/na.omit
+
=== General issues ===
* Subsetting: by data characteristics or by tree
+
* reimplementing methods as native phylobase rather than wrapping ape/ade4. How much? What criteria?
** get subtree with all descendants of a node
+
* synch with ape on name conflicts, multi.tree/multiPhylo
** get largest subtree containing species X and Y
+
* implement NAMESPACE?
** identify node graphically (locator/identify.node) to get subtree
+
* TreeLib?
** time slices: prune back by time
 
* (list from Wayne about what methods exist in Mesquite)
 
* What's available in other packages?
 
* node identification
 
* reordering (ordering may be an efficiency issue, but not for a while)
 
  
 
=== Testing ===
 
=== Testing ===
Line 79: Line 106:
 
* end users need to start reading documentation, testing ...
 
* end users need to start reading documentation, testing ...
  
 +
=== I/O ===
 +
 +
* integrate ioNCL.  Looks like this may be a moderately big task, because of the complexity of incorporating the entire ncl library in the package ...
 
[[Category:R Hackathon 1]]
 
[[Category:R Hackathon 1]]

Revision as of 00:25, 29 December 2007

Phylobase package

[I've renamed this from phylo4, and I'm now transcribing bits and pieces of the TODO list, and a variety of my own questions; I think it will be easier to 'discuss' in this format than in the TODO list within the package within the SVN repository. --Bbolker 22:57, 28 December 2007 (EST)]

Class definitions

See Data Standards page for definitions and discussion of the data classes (phylo4[d]/multiPhylo4[d]), including arguments about internal structure and some details on methods.

To do

Finish data classes

  • Tree model:
    • root node data characteristics (compare what OUCH does): root node doesn't have corresponding entry in the edge matrix: ouch adds a row to the edge matrix (1-row data frame matching node column info). Not entirely clear how we should deal with this, MB has a proposal.
  • Data model: There seems to be consensus that the data model should be at least a little bit richer than a straight data frame, that we should create a new data class (there is some more discussion of this over in Data Standards). There's a draft standard in pdata.R in the current package.
    • "Basic" metadata (slot 'type' in the data class) is implemented as a factor with allowed levels ("multitype","binary","continuous","DNA","RNA","aacid","other","unknown"). There's an argument for sticking with the NEXUS types exactly, and one for extending them.
    • "Extended" metadata (slot 'metadata' in the data class) is implemented as an arbitrary list.
    • There are two camps over how molecular data should be incorporated. Do we want two slots -- one for molecular and one for non-molecular -- or do we want to stick the molecular data (especially long alignments) in a data frame as (e.g.) a character? Doing it the first way makes it hard to do subsetting operations transparently; doing it the second way makes handling genetic data harder (and wastes space etc.). If we do it the first way we can just extend the data.frame class. A richer or more extendable would solve the problem, if it exists. If we design the accessor functions really well, and people use them rather than digging into the guts, we can change things later. It is possible to have a "raw" vector within a data frame (which is the base type for "DNAbin" objects in ape), but we can't have a DNAbin object itself. We could either convert to/from DNAbin as we accessed slots from the data frame (ugh), or we could try to write the rich data frame. Do the people at the R Genetics project have any ideas about this?
    • do we want to allow an accessor method to do the equivalent of tdata(object)[i,j] <- value ? or should people mess with their data before they attach it to the tree? (is this harder once the data are packed inside an additional layer of structure?)
  • multiPhylo classes
    • the basic idea for these is defined, but we are missing most of the methods
    • need appropriate conversions. ctree (cbindtree?) should collapse a list of phylo4x/multiPhylo4x objects into a single multiPhylo4x object (defaulting to the "richest" of its constituents -- i.e. multiPhylo4d unless all items are phylo4/multiPhylo4)
    • going back the other way, would like something like unpack() ? to produce a list of trees from a single multiPhylo4x object
    • I'm not sure if the argument about whether multiPhylo can or should formally extend phylo is settled, but I don't think it can/should [--Bbolker 22:57, 28 December 2007 (EST)]
    • show, summary, plot methods can mimic ape's new multiphylo class (although we will have to do something different for multiPhylo4d, to match what we do for phylo4d)
    • eventually may want to use something like the [trackObjs package] if we have lots and lots of trees -- although I don't know if this will actually work well here
    • need to think about whether we want to write some kind of apply() function for these -- or, alternatively, just provide documentation for how you would use [ls]apply on a multiPhylo object (sapply(object@phylolist,fn,data=tdata(object),...))

Terminology

Some things to think about (re)naming:

  • tree walking: getDescend, getAncest, allDescend, allAncest -- sons and fathers, daughters and mothers, tips/leaves, ?
  • should check_data be check_phylo4d for consistency with check_phylo4 (or should check_phylo4 be check_tree?)
  • EdgeLength/BranchLength?
  • root, Root, rootNode, RootNode?
  • subset(phy,node.subtree= ) = subset(phy,subtree= ) ?

Default behaviors

  • assigning edge and node labels -- is there/should there be an option to turn this off? assign empty ("") labels?
  • should default node labels start at N(nTips+1), to match internal node numbers?
  • should coercing phylo4d to phylo4 give a warning message about losing data? Should phylo4d to phylo4?

Data checking

  • should check_phylo4 check for cyclicity, or is this overkill?
  • what other checks should there be on a phylo4 tree? right now we check for (1) right number of edge lengths (if >0), (2) right number of tip labels, (3) at most one root, (4) only one ancestor per node.
  • information returned by check_data isn't quite the same format as geiger's matching function. Clear enough?
  • should there be an option for pruning tree to data? does this already exist?
  • option for fuzzy matching with agrep?
  • extend label-matching code to node labels

End users

  • What do people like to do in order to check trees and data?
  • USE CASES

Documentation

  • Does S4-style help access really work?
  • is the phylo4 constructor actually documented anywhere?
  • work through Sidlauskas wiki page

Plotting

  • other ways of plotting tree+data? isoMDS + color space?
  • ggplot2 extension? grid version?
  • modify phylo4 method to allow "Singles" (plot.phylo doesn't) ?
  • plotting with data and show.node.label scrambles order of node labels
  • plotting with subsetted data can throw error (see tests/plottest.R)
  • multiPhylo4x plotting
  • look for nice examples of ways people have plotted trees+data

Tree manipulation

  • Movement along branches, traversal, etc.: what operations are needed?
  • General practice: how much should we work with/return internal node numbers, and how much node names/tags? Current methods are a mishmash
  • allDescend gets tips only -- option to get list including internal nodes?
  • Check O'Meara list of functions
  • Check against Mesquite list of functions
  • from geiger: steal prune.extinct.taxa, prune.random.taxa?
  • Subsetting: by data characteristics or by tree
    • get subtree with all descendants of a node
    • time slices: prune back by time ?
  • reordering (ordering may be an efficiency issue, but not for a while): all phylo4 edge matrices are internally stored as "cladewise"/Newick traversal order. Extend edges accessor method with optional argument that allows edges to be extracted in a different order (possibly with an attribute as in ape); need to steal 'reorder' from ape, extend for phylo4d keeping track of traits. Also see ape::ladderize
  • rooting
    • implement re-rooting
    • check with Sidlauskas: is there an "infelicity" in the way ape does it that we should modify?

Other

  • options
    • phylo.compat: default=TRUE, allows accessing internal phylo4 elements with $ and $<-
    • check.data defaults
    • cf implementations in other packages (e.g. bbmle)

General issues

  • reimplementing methods as native phylobase rather than wrapping ape/ade4. How much? What criteria?
  • synch with ape on name conflicts, multi.tree/multiPhylo
  • implement NAMESPACE?
  • TreeLib?

Testing

  • end users need to start reading documentation, testing ...

I/O

  • integrate ioNCL. Looks like this may be a moderately big task, because of the complexity of incorporating the entire ncl library in the package ...