Difference between revisions of "R Hackathon 1/Phylobase"

From Phyloinformatics
Jump to: navigation, search
(2 intermediate revisions by 2 users not shown)
Line 9: Line 9:
== To do ==
== To do ==
=== Finish data classes ===
The to do section has now been merged with the TODO file inside the package, please refer to that.  Feature requests can be place at the R-forge page [http://phylobase.r-forge.r-project.org/ phylobase]
* Tree model:
** root node data characteristics (compare what OUCH does): root node doesn't have corresponding entry in the edge matrix:  ouch adds a row to the edge matrix (1-row data frame matching node column info). Not entirely clear how we should deal with this, MB has a proposal.
* Data model: There seems to be consensus that the data model should be at least a little bit richer than a straight data frame, that we should create a new data class (there is some more discussion of this over in [[R Hackathon 1/Data Standards|Data Standards]]).  There's a draft standard in pdata.R in the current package.
** "Basic" metadata (slot 'type' in the data class) is implemented as a factor with allowed levels ("multitype","binary","continuous","DNA","RNA","aacid","other","unknown"). There's an argument for sticking with the NEXUS types exactly, and one for extending them.
** "Extended" metadata (slot 'metadata' in the data class) is implemented as an arbitrary list.
** There are two camps over how molecular data should be incorporated.  Do we want two slots -- one for molecular and one for non-molecular -- or do we want to stick the molecular data (especially long alignments) in a data frame as (e.g.) a character? Doing it the first way makes it hard to do subsetting operations transparently; doing it the second way makes handling genetic data harder (and wastes space etc.).  If we do it the first way we can just extend the data.frame class. A richer or more extendable would solve the problem, if it exists.  '''If we design the accessor functions really well, and people use them rather than digging into the guts, we can change things later.'''  It is possible to have a "raw" vector within a data frame (which is the base type for "DNAbin" objects in ape), but we can't have a DNAbin object itself.  We could either convert to/from DNAbin as we accessed slots from the data frame (ugh), or we could try to write the rich data frame.  Do the people at the [http://rgenetics.org/ R Genetics project] have any ideas about this?
** do we want to allow an accessor method to do the equivalent of tdata(object)[i,j] <- value ? or should people mess with their data before they attach it to the tree? (is this harder once the data are packed inside an additional layer of structure?)
* multiPhylo classes
** the basic idea for these is defined, but we are missing most of the methods
** need appropriate conversions.  ctree (cbindtree?) should collapse a list of phylo4x/multiPhylo4x objects into a single multiPhylo4x object (defaulting to the "richest" of its constituents -- i.e. multiPhylo4d unless all items are phylo4/multiPhylo4)
** going back the other way, would like something like unpack() ? to produce a list of trees from a single multiPhylo4x object
** I'm not sure if the argument about whether multiPhylo can or should formally extend phylo is settled, but I don't think it can/should [--[[User:Bbolker|Bbolker]] 22:57, 28 December 2007 (EST)]
** show, summary, plot methods can mimic ape's new multiphylo class (although we will have to do something different for multiPhylo4d, to match what we do for phylo4d)
** eventually may want to use something like the [[http://wiki.r-project.org/rwiki/doku.php?id=packages:cran:trackobjs trackObjs package]] if we have lots and lots of trees -- although I don't know if this will actually work well here
** need to think about whether we want to write some kind of apply() function for these -- or, alternatively, just provide documentation for how you would use [ls]apply on a multiPhylo object (sapply(object@phylolist,fn,data=tdata(object),...))
=== Terminology ===
Some things to think about (re)naming:
* tree walking: getDescend, getAncest, allDescend, allAncest -- sons and fathers, daughters and mothers, tips/leaves, ?
* should check_data be check_phylo4d for consistency with check_phylo4 (or should check_phylo4 be check_tree?)
* EdgeLength/BranchLength?
* root, Root, rootNode, RootNode?
* subset(phy,node.subtree= ) = subset(phy,subtree= ) ?
=== Default behaviors ===
* assigning edge and node labels -- is there/should there be an option to turn this off?  assign empty ("") labels?
* should default node labels start at N(nTips+1), to match internal node numbers?
* should coercing phylo4d to phylo4 give a warning message about losing data?  Should phylo4d to phylo4?
=== Data checking ===
* should check_phylo4 check for cyclicity, or is this overkill?
* what other checks should there be on a phylo4 tree?  right now we check for (1) right number of edge lengths (if >0), (2) right number of tip labels, (3) at most one root, (4) only one ancestor per node.
* information returned by check_data isn't quite the same format as geiger's matching function. Clear enough?
* should there be an option for pruning tree to data? does this already exist?
* option for fuzzy matching with agrep?
* extend label-matching code to node labels
=== End users ===
* What do people like to do in order to check trees and data?
=== Documentation ===
* Does S4-style help access really work?
* is the phylo4 constructor actually documented anywhere?
* work through Sidlauskas wiki page
=== Plotting ===
* other ways of plotting tree+data?  isoMDS + color space?
* ggplot2 extension?  grid version?
* modify phylo4 method to allow "Singles" (plot.phylo doesn't) ?
* plotting with data and show.node.label scrambles order of node labels [''this is really a problem with ade4 vs ape ordering of nodes; while a round trip from Newick and back again preserves the order of the nodes in the ape version of the geospiza tree, writing out to Newick and reading back with newick2phylog doesn't.  This could be kind of complicated to figure out/fix. For the time being I've added a strong warning --[[User:Bbolker|Bbolker]] 10:26, 29 December 2007 (EST)''
* plotting with subsetted data can throw error (see tests/plottest.R)
* multiPhylo4x plotting
* look for nice examples of ways people have plotted trees+data
=== Tree manipulation ===
* Movement along branches, traversal, etc.: what operations are needed?
* General practice: how much should we work with/return internal node numbers, and how much node names/tags?  Current methods are a mishmash
* allDescend gets tips only -- option to get list including internal nodes?
* Check O'Meara list of functions
* Check against Mesquite list of functions
* from geiger: steal prune.extinct.taxa, prune.random.taxa?
* Subsetting: by data characteristics or by tree
** get subtree with all descendants of a node
** time slices: prune back by time ?
* reordering (ordering may be an efficiency issue, but not for a while): all phylo4 edge matrices are internally stored as "cladewise"/Newick traversal order.  Extend edges accessor method with optional argument that allows edges to be extracted in a different order (possibly with an attribute as in ape); need to steal 'reorder' from ape, extend for phylo4d keeping track of traits. Also see ape::ladderize
* rooting
** implement re-rooting
** check with Sidlauskas: is there an "infelicity" in the way ape does it that we should modify?
=== Other ===
* options
** phylo.compat: default=TRUE, allows accessing internal phylo4 elements with $ and $<-
** check.data defaults
** cf implementations in other packages (e.g. bbmle)
=== General issues ===
* need to reorder files somehow to make sure that identify function comes '''after''' phylo4 definition (avoid warning on pkg build) [could use a "Collate" statement in the DESCRIPTION file '''or''' rename phylo4.R to aaaphylo4.R ...]
* reimplementing methods as native phylobase rather than wrapping ape/ade4.  How much? What criteria?
* synch with ape on name conflicts, multi.tree/multiPhylo
* implement NAMESPACE?
* TreeLib?
=== Testing ===
* end users need to start reading documentation, testing ...
=== I/O ===
* integrate ioNCL.  Looks like this may be a moderately big task, because of the complexity of incorporating the entire ncl library in the package ...
[[Category:R Hackathon 1]]
[[Category:R Hackathon 1]]

Latest revision as of 14:49, 6 February 2008

Phylobase package

[I've renamed this from phylo4, and I'm now transcribing bits and pieces of the TODO list, and a variety of my own questions; I think it will be easier to 'discuss' in this format than in the TODO list within the package within the SVN repository. --Bbolker 22:57, 28 December 2007 (EST)]

Class definitions

See Data Standards page for definitions and discussion of the data classes (phylo4[d]/multiPhylo4[d]), including arguments about internal structure and some details on methods.

To do

The to do section has now been merged with the TODO file inside the package, please refer to that. Feature requests can be place at the R-forge page phylobase