R Hackathon 1/End User Goals

From Phyloinformatics
Revision as of 11:58, 6 December 2007 by Bco (talk) (Added info on EvolDir user poll)
Jump to: navigation, search

Goals from a programmer perspective are on a separate page.

Goals from an end-user perspective

Common methodological/logistical challenges for end-users

  • Phylogenetic uncertainty
    • Multiple trees
    • Polytomies
      • Resolving and re-analyzing/averaging automatically
      • Explicitly analyzing
    • Incorporating bootstrap support or posterior probabilities for branches
    • Model averaging over sets of trees
  • Phylogeny format and structure
    • Reading and writing Newick, Nexus tree format
      • e.g. Currently can't read/write Nexus tree notes, Nexus data blocks
      • Major problems reading large files with lots of trees in R. Coding this in C would help dramatically.
      • Reading and writing MrBayes, Mesquite, PAUP and other external package formats
    • Converting among tree and data formats used by different R packages
      • e.g. How do I evolve trees and traits in packages X and Y and analyze in package Z?
  • Tree manipulation
    • Include or exclude taxa based on the data available in your dataset
      • I have a way to do this in the new version of GEIGER, but it's not elegant--Lukeh@uidaho.edu 20:11, 15 November 2007 (EST)
    • Grafting and pruning subtrees (while maintaining branch lengths)
    • Rooting trees
  • Tree simulation methods (+- accompanying trait evolution)
    • Needed for methods testing and hypothesis testing.--DavidOrme 05:40, 22 November 2007 (EST)
      • Need for speed and ability to simulate large phylogenetic trees in R - I think this should be coded in C and called from R.
  • Exploration of trait data
    • Testing for phylogenetic signal/choosing the appropriate transformation if needed
    • Conducting analyses with traits exhibiting different levels of phylogenetic signal
  • Easy implementation of error/data checking methods
    • Ensure trait data are linked to the correct tip on the tree
    • Diagnostics for data checking
      • PIC diagnostics
      • Linearity in trait relations
  • Choosing/scaling branch lengths
    • Branch scaling algorithms (e.g. Grafen, ACDC, lambda)
      • You can do ACDC ("exponentialchange.tree") and some other things like that in GEIGER, others are possible --Lukeh@uidaho.edu 20:11, 15 November 2007 (EST)
    • Incorporating/estimating divergence dates
  • Ability to use large datasets and trees
    • Memory
    • Speed

Filling-in the gaps, methods that are not easily accessible in R but would be useful

  • Methods for visualising trees and plotting traits along them
  • Linking phylogenies to geographic maps
  • Implementation of methods for looking at co-evolution and co-speciation
    • Event-based methods, reconciliation methods
    • New methods for testing congruence between trees
    • Related: Gene tree - species tree analyses (or anything with a contained and a containing tree)
  • Quantifying phylogenetic signal
  • Discrete traits
    • CAIC BRUNCH algorithm
      • This is included in the CAIC package I am just building that reimplements CAIC in R --DavidOrme 04:58, 22 November 2007 (EST)
    • Pagel's Discrete
      • This is a part of the new GEIGER --Lukeh@uidaho.edu 20:11, 15 November 2007 (EST)
    • Multiple discrete and continuous traits in a single analysis
    • Non-linear trait distributions and analyses
  • Stochastic character mapping
    • Ree's (2005) key innovation test
  • Correlating trait evolution with speciation and extinction rates
  • Extending the set of character models that can be fit using likelihood
    • For example, real population genetic models e.g. Estes and Arnold
  • Model averaging across different models of character evolution
  • Better methods for reconstructing ancestral character states and plotting them on trees
  • MacroCAIC - type analyses
    • The CAIC package I'm putting together includesthis --DavidOrme 04:58, 22 November 2007 (EST)
  • Simulating trees and characters for other models beyond those available now
  • Fitting Felsenstein's threshold models to comparative data
  • Ree's biogeography method using likelihood
  • Interface or implementation of penalized likelihood from r8s
  • Creating input files with constraints that are acceptable for BEAST and MrBayes
  • Supertree or other tree-combining methods
  • Topology-based tests of diversification ala Moore et al.

Improved documentation and an easy way to find out what methods are available

  • Summaries of methods available to answer different questions
    • Matrix listing all available functions
    • My Treetapper.org NESCent project will include the ability to find which R package(s) or other software program/package implements a particular method (i.e., a user tells the website she has two continuous chars to correlate, is presented with a list of methods for this, chooses independent contrasts, and then gets a list of all packages and software programs implementing this (and can limit by platform, implementation type, etc.)). It won't list the specific function, but hopefully people can at least look this up. The site won't be really up for another month or so, and won't be really useful until I can add more data to the database. --Bco 13:47, 30 November 2007 (EST)
  • Improve or write documentation for existing functions that are not documented
  • Code Vignettes
  • Common datasets that can be analyzed to illustrate different methods/approaches

To discuss documentation standards, vignettes etc.

Evoldir poll

We sent EvolDir a link to a poll on what users want to see added. Here are the results so far (after one day). We had 53 poll responses and two emailed responses; one was a request for a polymorphism hackathon to implement methods for dealing with polymorphic data, the other appears after the poll. The poll was not intended to mandate which methods are worked on at the hackathon, just provide some additional information about community interest.

 Please select three options for inclusion in R:
  Improved documentation and an easy way to find out what methods are available  21        8%
  Linking phylogenies to geographic maps  20        8%
  Incorporating bootstrap support or posterior probabilities for branches  15        6%
  Multiple discrete and continuous traits in a single analysis  14        5%
  Dealing with multiple trees  11        4%
  Gene tree - species tree analyses  10        4%
  Creating input files with constraints that are acceptable for BEAST and MrBayes  10        4%
  Ability to use large datasets and trees        4%
  Methods for visualising trees and plotting traits along them        4%
  Reading and writing Newick, Nexus tree format        3%
  Converting among tree and data formats used by different R packages        3%
  Dealing properly with polytomies        3%
  Tree simulation methods (+/- accompanying trait evolution)        3%
  Testing for phylogenetic signal/choosing the appropriate transformation if needed        3%
  New methods for testing congruence between trees        3%
  Extending the set of character models that can be fit using likelihood        3%
  Interface or implementation of penalized likelihood from r8s        3%
  Model averaging over sets of trees         2%
  Incorporating/estimating divergence dates         2%
  Model averaging across different models of character evolution        2%
  Include or exclude taxa based on the data available in your dataset        2%
  Diagnostics for data checking        2%
  Correlating trait evolution with speciation and extinction rates        2%
  Supertree or other tree-combining methods        2%
  Event-based methods for looking at co-evolution, reconciliation methods        2%
  Non-linear trait distributions and analyses        2%
  Stochastic character mapping        2%
  Better methods for reconstructing & plotting ancestral character states        2%
  Fitting Felsenstein's threshold models to comparative data        2%
  Ree's biogeography method using likelihood        2%
  Simulating trees and characters for other models beyond those available now        1%
  Topology-based tests of diversification ala Moore et al.         1%
  localizing diversification rate        1%
  population genetics and quantita        1%
  Grafting and pruning subtrees (while maintaining branch lengths)        1%
  Conducting analyses with traits exhibiting different levels of phylogenetic signal         1%
  Multivariate Phylogenetic Mixed        0%
  Ensure trait data are linked to the correct tip on the tree        0%
  Branch scaling algorithms (e.g. Grafen, ACDC, lambda)        0%

Free response:

  • Since R does GLMs so well (though isn't so easy to to ANOVAs with molecular or character data -- esp. w/ phylogenetic data), these types of methods are really appropriate (PGLS, PMM). My one big frustration was not having PMM well implemented. Paradis et al. have put it in, but it is only univariate and doesn't properly analyze all types of tree length inputs. I think someone else may also have implemented it, but as Felsenstein's lambda (maybe Garland?). Anyway, R would be a great platform for implementing PMM fully, particularly with multivariate data (see the appendix of Housworth et al. 2002 for the math on this). That would make R capable of doing all the most common comparative trait methods (independent contrasts in PDAP, PGLS in APE, and PMM in ?).