R Hackathon 1/Phylobase

From Phyloinformatics
Revision as of 23:03, 28 December 2007 by Bbolker (talk) (R Hackathon 1/Phylo4 moved to R Hackathon 1/Phylobase: data class is now a more general class)
Jump to: navigation, search

Phylo4 class

Class definition

See Data Standards page for definitions and discussion of the data class, including arguments about internal structure and some details on methods.

Tasks

Logistics/maintenance

  • Talk to Emmanuel and figure out how he feels about phylo4: is he comfortable with phylo4 being set up as a parallel standard (and possibly eventually taking over more of the basic data manipulation stuff), but taking a lot of material from ape? Does he have any interest in being the maintainer? [DONE]
  • Will this be a CRAN package? Who will the maintainers be? Will there be a "phylocore" ? (Discuss at end of week)

I/O

  • Translation to/from phylo[ape] is working (although not tested much) -- translation to/from other packages is low priority since phylo[ape] is the * "Rosetta stone" (at least for now, may change somewhat if other packages end up wanting to use a richer data model)
  • Liaison with i/o group: target input functions from Nexus/Newick to create phylo4 or phylo4d objects. Final versions will have to wait on the data model. [Brian O'Meara]

Data class definition

  • There seems to be consensus that the data model should be at least a little bit richer than a straight data frame, that we should create a new data class.
    • We probably want at least one "metadata" tag (factor w/ levels binary, multistate, DNA, (nucleotide?) [? about Nexus definitions ?], amino acid, RNA, continuous, 'other'.
    • Two camps over how molecular data should be incorporated. Do we want two slots -- one for molecular and one for non-molecular -- or do we want to stick the molecular data (especially long alignments) in a data frame as (e.g.) a character? Doing it the first way makes it hard to do subsetting operations transparently; doing it the second way makes handling genetic data harder (and wastes space etc.). If we do it the first way we can just extend the data.frame class. A richer or more extendable would solve the problem, if it exists. If we design the accessor functions really well, and people use them rather than digging into the guts, we can change things later.

Tree manipulation

We should look at the documentation group's lists of useful operations.

  • Movement along branches, traversal, etc.: set up a "tagged" form of a tree that saves the current position on the tree (as an attribute or a slot?) and can be modified by operations like Up, Down?

Would it be more robust to have to specify a node in the function? So, for instance getAncestors(phy, 25) or getDescendents(phy, 25), where 25 is a node number. These could be used in recursions and it avoids the problem of maintaining or resetting a pointer on the tree. They would need to report NULL or something like it when falling off the end of the tree. --DavidOrme 14:35, 12 December 2007 (EST)


  • Drop/prune/na.omit
  • Subsetting: by data characteristics or by tree
    • get subtree with all descendants of a node
    • get largest subtree containing species X and Y
    • identify node graphically (locator/identify.node) to get subtree
    • time slices: prune back by time
  • (list from Wayne about what methods exist in Mesquite)
  • What's available in other packages?
  • node identification
  • reordering (ordering may be an efficiency issue, but not for a while)

Testing

  • end users need to start reading documentation, testing ...