R Hackathon 1/Phylobase
[I've renamed this from phylo4, and I'm now transcribing bits and pieces of the TODO list, and a variety of my own questions; I think it will be easier to 'discuss' in this format than in the TODO list within the package within the SVN repository. --Bbolker 22:57, 28 December 2007 (EST)]
See Data Standards page for definitions and discussion of the data classes (phylo4[d]/multiPhylo4[d]), including arguments about internal structure and some details on methods.
Finish data classes
- Data modelThere seems to be consensus that the data model should be at least a little bit richer than a straight data frame, that we should create a new data class (there is some more discussion of this over in Data Standards). There's a draft standard in pdata.R in the current package.
- "Basic" metadata (slot 'type' in the data class) is implemented as a factor with allowed levels ("multitype","binary","continuous","DNA","RNA","aacid","other","unknown"). There's an argument for sticking with the NEXUS types exactly, and one for extending them.
- "Extended" metadata (slot 'metadata' in the data class) is implemented as an arbitrary list.
- There are two camps over how molecular data should be incorporated. Do we want two slots -- one for molecular and one for non-molecular -- or do we want to stick the molecular data (especially long alignments) in a data frame as (e.g.) a character? Doing it the first way makes it hard to do subsetting operations transparently; doing it the second way makes handling genetic data harder (and wastes space etc.). If we do it the first way we can just extend the data.frame class. A richer or more extendable would solve the problem, if it exists. If we design the accessor functions really well, and people use them rather than digging into the guts, we can change things later. It is possible to have a "raw" vector within a data frame (which is the base type for "DNAbin" objects in ape), but we can't have a DNAbin object itself. We could either convert to/from DNAbin as we accessed slots from the data frame (ugh), or we could try to write the rich data frame. Do the people at the R Genetics project have any ideas about this?
- do we want to allow an accessor method to do the equivalent of tdata(object)[i,j] <- value ? or should people mess with their data before they attach it to the tree? (is this harder once the data are packed inside an additional layer of structure?)
- multiPhylo classes
- the basic idea for these is defined, but we are missing most of the methods
- need appropriate conversions. ctree (cbindtree?) should collapse a list of phylo4x/multiPhylo4x objects into a single multiPhylo4x object (defaulting to the "richest" of its constituents -- i.e. multiPhylo4d unless all items are phylo4/multiPhylo4)
- going back the other way, would like something like unpack() ? to produce a list of trees from a single multiPhylo4x object
- I'm not sure if the argument about whether multiPhylo can or should formally extend phylo is settled, but I don't think it can/should [--Bbolker 22:57, 28 December 2007 (EST)]
- show, summary, plot methods can mimic ape's new multiphylo class (although we will have to do something different for multiPhylo4d, to match what we do for phylo4d)
- eventually may want to use something like the [trackObjs package] if we have lots and lots of trees -- although I don't know if this will actually work well here
- need to think about whether we want to write some kind of apply() function for these -- or, alternatively, just provide documentation for how you would use [ls]apply on a multiPhylo object (sapply(object@phylolist,fn,data=tdata(object),...))
Some things to think about (re)naming:
- tree walking: getDescend, getAncest, allDescend, allAncest -- sons and fathers, daughters and mothers, tips/leaves, ?
- should check_data be check_phylo4d for consistency with check_phylo4 (or should check_phylo4 be check_tree?)
- root, Root, rootNode, RootNode?
- subset(phy,node.subtree= ) = subset(phy,subtree= ) ?
- assigning edge and node labels -- is there/should there be an option to turn this off? assign empty ("") labels?
- should default node labels start at N(nTips+1), to match internal node numbers?
- information returned by check_data isn't quite the same format as geiger's matching function. Clear enough?
- should coercing phylo4d to phylo4 give a warning message about losing data? Should phylo4d to phylo4?
- should check_phylo4 check for cyclicity, or is this overkill?
- What do people like to do in order to check trees and data?
- USE CASES
- Does S4-style help access really work?
- is the phylo4 constructor actually documented anywhere?
We should look at the documentation group's lists of useful operations.
- Movement along branches, traversal, etc.: set up a "tagged" form of a tree that saves the current position on the tree (as an attribute or a slot?) and can be modified by operations like Up, Down?
Would it be more robust to have to specify a node in the function? So, for instance getAncestors(phy, 25) or getDescendents(phy, 25), where 25 is a node number. These could be used in recursions and it avoids the problem of maintaining or resetting a pointer on the tree. They would need to report NULL or something like it when falling off the end of the tree. --DavidOrme 14:35, 12 December 2007 (EST)
- Subsetting: by data characteristics or by tree
- get subtree with all descendants of a node
- get largest subtree containing species X and Y
- identify node graphically (locator/identify.node) to get subtree
- time slices: prune back by time
- (list from Wayne about what methods exist in Mesquite)
- What's available in other packages?
- node identification
- reordering (ordering may be an efficiency issue, but not for a while)
- end users need to start reading documentation, testing ...