R Hackathon 1/End User Goals

From Phyloinformatics
Jump to: navigation, search

Goals from a programmer perspective are on a separate page.

Goals from an end-user perspective

Common methodological/logistical challenges for end-users

  • Phylogenetic uncertainty
    • Multiple trees
    • Polytomies
      • Resolving and re-analyzing/averaging automatically
      • Explicitly analyzing
    • Incorporating bootstrap support or posterior probabilities for branches
    • Model averaging over sets of trees
  • Phylogeny format and structure
    • Reading and writing Newick, Nexus tree format
      • e.g. Currently can't read/write Nexus tree notes, Nexus data blocks
      • Major problems reading large files with lots of trees in R. Coding this in C would help dramatically.
      • Reading and writing MrBayes, Mesquite, PAUP and other external package formats
    • Converting among tree and data formats used by different R packages
      • e.g. How do I evolve trees and traits in packages X and Y and analyze in package Z?
  • Tree manipulation
    • Include or exclude taxa based on the data available in your dataset
      • I have a way to do this in the new version of GEIGER, but it's not elegant--Lukeh@uidaho.edu 20:11, 15 November 2007 (EST)
    • Grafting and pruning subtrees (while maintaining branch lengths)
    • Rooting trees
  • Tree simulation methods (+- accompanying trait evolution)
    • Needed for methods testing and hypothesis testing.--DavidOrme 05:40, 22 November 2007 (EST)
      • Need for speed and ability to simulate large phylogenetic trees in R - I think this should be coded in C and called from R.
  • Exploration of trait data
    • Testing for phylogenetic signal/choosing the appropriate transformation if needed
    • Conducting analyses with traits exhibiting different levels of phylogenetic signal
  • Easy implementation of error/data checking methods
    • Ensure trait data are linked to the correct tip on the tree
    • Diagnostics for data checking
      • PIC diagnostics
      • Linearity in trait relations
  • Choosing/scaling branch lengths
    • Branch scaling algorithms (e.g. Grafen, ACDC, lambda)
      • You can do ACDC ("exponentialchange.tree") and some other things like that in GEIGER, others are possible --Lukeh@uidaho.edu 20:11, 15 November 2007 (EST)
    • Incorporating/estimating divergence dates
  • Ability to use large datasets and trees
    • Memory
    • Speed

Filling-in the gaps, methods that are not easily accessible in R but would be useful

  • Methods for visualising trees and plotting traits along them
  • Linking phylogenies to geographic maps
  • Implementation of methods for looking at co-evolution and co-speciation
    • Event-based methods, reconciliation methods
    • New methods for testing congruence between trees
    • Related: Gene tree - species tree analyses (or anything with a contained and a containing tree)
  • Quantifying phylogenetic signal
  • Discrete traits
    • CAIC BRUNCH algorithm
      • This is included in the CAIC package I am just building that reimplements CAIC in R --DavidOrme 04:58, 22 November 2007 (EST)
    • Pagel's Discrete
      • This is a part of the new GEIGER --Lukeh@uidaho.edu 20:11, 15 November 2007 (EST)
    • Multiple discrete and continuous traits in a single analysis
    • Non-linear trait distributions and analyses
  • Stochastic character mapping
    • Ree's (2005) key innovation test
  • Correlating trait evolution with speciation and extinction rates
  • Extending the set of character models that can be fit using likelihood
    • For example, real population genetic models e.g. Estes and Arnold
  • Model averaging across different models of character evolution
  • Better methods for reconstructing ancestral character states and plotting them on trees
  • MacroCAIC - type analyses
    • The CAIC package I'm putting together includesthis --DavidOrme 04:58, 22 November 2007 (EST)
  • Simulating trees and characters for other models beyond those available now
  • Fitting Felsenstein's threshold models to comparative data
  • Ree's biogeography method using likelihood
  • Interface or implementation of penalized likelihood from r8s
  • Creating input files with constraints that are acceptable for BEAST and MrBayes
  • Supertree or other tree-combining methods
  • Topology-based tests of diversification ala Moore et al.

Improved documentation and an easy way to find out what methods are available

  • Summaries of methods available to answer different questions
    • Matrix listing all available functions
    • My Treetapper.org NESCent project will include the ability to find which R package(s) or other software program/package implements a particular method (i.e., a user tells the website she has two continuous chars to correlate, is presented with a list of methods for this, chooses independent contrasts, and then gets a list of all packages and software programs implementing this (and can limit by platform, implementation type, etc.)). It won't list the specific function, but hopefully people can at least look this up. The site won't be really up for another month or so, and won't be really useful until I can add more data to the database. --Bco 13:47, 30 November 2007 (EST)
  • Improve or write documentation for existing functions that are not documented
  • Code Vignettes
  • Common datasets that can be analyzed to illustrate different methods/approaches

To discuss documentation standards, vignettes etc.

Evoldir poll

We sent EvolDir a link to a poll on what users want to see added. Here are the results so far (after one day). We had 53 poll responses and two emailed responses; one was a request for a polymorphism hackathon to implement methods for dealing with polymorphic data, the other appears after the poll. The poll was not intended to mandate which methods are worked on at the hackathon, just provide some additional information about community interest. Each user was asked to select three items.

 Please select three options for inclusion in R:
  Improved documentation and an easy way to find out what methods are available  22      8%
  Linking phylogenies to geographic maps  21      8%
  Incorporating bootstrap support or posterior probabilities for branches  17      6%
  Multiple discrete and continuous traits in a single analysis  14      5%
  Converting among tree and data formats used by different R packages  13      5%
  Dealing with multiple trees  12      4%
  Gene tree - species tree analyses  11      4%
  Dealing properly with polytomies  10      4%
  Reading and writing Newick, Nexus tree format  10      4%
  Methods for visualising trees and plotting traits along them  10      4%
  Creating input files with constraints that are acceptable for BEAST and MrBayes  10      4%
  Ability to use large datasets and trees    12    3%
  Tree simulation methods (+/- accompanying trait evolution)    13    3%
  Testing for phylogenetic signal/choosing the appropriate transformation if needed    14    3%
  New methods for testing congruence between trees    14    3%
  Extending the set of character models that can be fit using likelihood    14    3%
  Interface or implementation of penalized likelihood from r8s    14    3%
  Model averaging over sets of trees     18    2%
  Incorporating/estimating divergence dates     18    2%
  Model averaging across different models of character evolution    18    2%
  Include or exclude taxa based on the data available in your dataset    21    2%
  Diagnostics for data checking    21    2%
  Stochastic character mapping    21    2%
  Correlating trait evolution with speciation and extinction rates    21    2%
  Fitting Felsenstein's threshold models to comparative data    21    2%
  Ree's biogeography method using likelihood    21    2%
  Supertree or other tree-combining methods    21    2%
  Event-based methods for looking at co-evolution, reconciliation methods    28    1%
  Non-linear trait distributions and analyses    28    1%
  Better methods for reconstructing & plotting ancestral character states    28    1%
  Simulating trees and characters for other models beyond those available now    31    1%
  Topology-based tests of diversification ala Moore et al.     31    1%
  localizing diversification rate    33    1%
  population genetics and quantita    33    1%
  Grafting and pruning subtrees (while maintaining branch lengths)    33    1%
  Conducting analyses with traits exhibiting different levels of phylogenetic signal     33    1%
  Ensure trait data are linked to the correct tip on the tree    33    1%
  Multivariate Phylogenetic Mixed    38    0%
  Branch scaling algorithms (e.g. Grafen, ACDC, lambda)    38    0%

Free response:

  • Since R does GLMs so well (though isn't so easy to to ANOVAs with molecular or character data -- esp. w/ phylogenetic data), these types of methods are really appropriate (PGLS, PMM). My one big frustration was not having PMM Phylogenetic mixed models -- bco well implemented. Paradis et al. have put it in, but it is only univariate and doesn't properly analyze all types of tree length inputs. I think someone else may also have implemented it, but as Felsenstein's lambda (maybe Garland?). Anyway, R would be a great platform for implementing PMM fully, particularly with multivariate data (see the appendix of Housworth et al. 2002 for the math on this). That would make R capable of doing all the most common comparative trait methods (independent contrasts in PDAP, PGLS in APE, and PMM in ?).