PhyloSoC:Enhancing the representation of ecophylogenetic tools in R

From Phyloinformatics
Revision as of 18:12, 26 May 2008 by Mrhelmus (talk)
Jump to: navigation, search


My Summer of Code project is to greatly expand the R package picante, a toolbox of comparative methods used to assess phylogenetic community structure. I will code three types of published ecological phylogenetic comparative methods. These methods can be used to assess phylogenetic community structure, the effects of environmental variables on phylogenetic community structure, and phylogenetic structure among species interactions. In other words, the methods can be used to ask questions such as: are closely related species found within the same community; are closely related species found in the same environments; and do closely related species interact with the same species. I will also code data handling routines that quickly randomize large matrices. These routines allow for tests of statistical significance with the methods. The product of my Summer of Code experience will be a comprehensive R package of phylogenetic comparative methods that will facilitate the use of phylogenetics in ecological research.

Detailed Description

Project Background

Phylogenetic comparative methods are increasingly used in ecological research. Incorporating phylogenetics into ecology is appealing because it allows ecological questions to be addressed in an evolutionary context, leading to a deep understanding of natural phenomena. Consequently, new “ecophylogenetic” comparative methods are rapidly being developed (I have adopted this term from Jeaninne Cavender-Bares). However, there are three major barriers that prevent many ecologists from using ecophylogenetic methods. First, many of these new tools are not available as open-source software. Second, available tools are coded in multiple programs, all with unique learning curves. Finally, many of the programs are in a language that not many ecologists know, prohibiting end-users from modifying functions to suit their own purposes. The goal of my project is to enhance the ecophylogenetic R package picante by 1) coding more ecophylogenetic methods into the package; and 2) coding new data handling routines that allow for randomization tests to be quickly performed with the methods. If funded, I will help integrate phylogenetics and ecology by coding a comprehensive package of ecophylogenetic methods in R, a statistical platform and language that many ecologists already know. Biography

My research is on how phylogenetics can lead to better understanding of ecological community assembly, species extinction, and species invasion. For example, I resampled roughly 200 freshwater fish communities in central Mexico that had been sampled at least one time previously in the last 100 years. This data set is large with many communities; many species invasions and extinctions; temporal and spatial autocorrelation; environmental data; trait data; and a large phylogenetic tree. Quantifying the interactions of all these factors required me to develop novel statistical tools and computer programs. My interests are thus at the intersection of computational comparative biology, ecology, and evolution.

Two of the papers I have published in graduate school contain open-source R and MATLAB code (Helmus et al. 2007a; Helmus et al. 2007b). I also have experience as an undergraduate in C++.

A Summer of Code experience will allow me to greatly expand my programming skills. However, the main reason I am applying is because I want to become part of the NESCent phyloinformatics community. After my summer of code is over, I plan to continue to work with the developers of phylobase and picante, and the phyloinformatics group. I feel that working with this group is important since I will likely continue to use and develop ecophylogenetic methods throughout my academic career.

My only other obligation this summer is to make any edits suggested by my committee to my dissertation (I will defend on April 30, but officially graduate at the summer’s end). I will make the Summer of Code project the focus of my summer and will treat it like a full-time job.

Project Plan

Component 1

I will greatly expand picante by adding three types of ecophylogenetic methods: diversity metrics, species regressions, and bipartite species analyses. Under each section below I have listed: the data each method requires (Data Input), existing code that will guide my R code (Backbone), programs already available for the methods (Current Availability), and my specific products (Deliverable).

1A I will code the PD metric (Faith 1992), the metrics described by Hardy & Senterre (2007), the delta metrics of Clarke & Warwick (1998; 1999), and the metrics developed by my colleagues and I (Helmus et al. 2007a). These metrics represent different phylogenetic components of biodiversity; were developed for different applications and research topics; and/or have different statistical properties. These metrics will increase the metrics already coded in picante (i.e., those of Webb et al. 2002). Coding these metrics will facilitate future research that compares metrics to understand ecological phenomena; and research to understand the similarities and differences of the methods.

-Data Input: a phylo4 object (or other tree object); a community incidence matrix (i.e., an object of species presence/absence or abundance/biomass across communities)

-Backbone: my published and unpublished R code; the cophenetic and vcv.phylo functions; the diversity metric functions in picante (e.g., mnnd)

-Current Availability: Some of these methods are available as computer programs in various formats from the authors.

-Deliverable: a comprehensive R toolkit of the most commonly used phylogenetic biodiversity metrics

1B I will code community regression techniques that partition determinants of phylogenetic community structure using environmental data (Helmus et al. 2007b). Theses techniques are broadly applicable to many data sets; and can be used to understand the causes of observed patterns in phylogenetic structure across communities.

-Data Input: a phylo4 object; an incidence matrix; an environmental matrix

-Backbone: my published MATLAB code; stats package functions

-Current Availability: my published MATLAB code specific to my data set

-Deliverable: a set of R functions to perform these analyses with any data set

1C I will code a set of phylogenetic comparative methods that estimate phylogenetic signal in the interactions of species between two trophic levels or two sets of mutualistic species (Ives & Godfray 2006). These methods will facilitate scientists who want to ask specific questions such as: do closely related herbivores, with similar traits, feed on the same plants; and can phylogeny predict what native plants an invasive pollinator will pollinate.

-Data Input: two phylo4 or 4d objects; a species-species matrix of interaction strengths

-Backbone: published and unpublished MATLAB code; stats package functions

-Current Availability: published and unpublished MATLAB code (not user friendly)

-Deliverable: a set of R functions to perform these analyses with any data set

Component 2

Randomization tests are used widely in ecophylogenetic research to provide p-values for test statistics and test for nonrandom patterns in community structure. All methods described above are used in conjunction with data randomizations for these purposes. The typical data matrix used in ecophylogenetic research is large, and many randomizations are needed in order to create probability distributions. However, current functions in R do not quickly perform many randomizations for large data matrices. This is a serious hindrance to performing ecophylogenetic data analyses using only R. I propose to write, or augment, a set of functions that will quickly randomize data matrices. I will do this by either 1) changing existing functions; 2) using matrix indexing with the existing sample and apply functions; or 3) writing a series of C/C++ extensions to perform the loops. Specifically, I will create functions to randomize a matrix by maintaining only row totals, only column totals, and both row and column totals. These are the most widely used randomizations in ecophylogenetic research. Augmenting existing functions may be adequate. For example, I changed a line in the picante function randomizeSample from apply(samp,1,sample) to samp[,sample(ncol(samp),replace=FALSE)]. The runtime of the edited function is ~50 times faster than the old.

-Data Input: a data matrix

-Backbone: the sample function; functions in ecodist; functions in picante

-Current Availability (for R only): for-loops; the sample function; randomizeSample

-Deliverable: a set of functions that will quickly randomize data matrices and edits to picante functions to use these quicker randomizations


1. Clarke K. & Warwick R. (1999) The taxonomic distinctness measure of biodiversity: weighting of step lengths between hierarchical levels. Marine Ecology-Progress Series, 184, 21-29

2. Clarke K.R. & Warwick R.M. (1998) A taxonomic distinctness index and its statistical properties. Journal of Applied Ecology, 35, 523-531

3. Faith D.P. (1992) Conservation evaluation and phylogenetic diversity. Biological Conservation, 61, 1-10

4. Hardy O.J. & Senterre B. (2007) Characterizing the phylogenetic structure of communities by an additive partitioning of phylogenetic diversity. Journal of Ecology, 95, 493-506

5. Helmus M.R., Bland T.J., Williams C.K. & Ives A.R. (2007a) Phylogenetic measures of biodiversity. American Naturalist, 169, E68-E83

6. Helmus M.R., Savage K., Diebel M.W., Maxted J.T. & Ives A.R. (2007b) Separating the determinants of phylogenetic community structure. Ecology Letters, 10, 917-925

7. Ives A.R. & Godfray H.C. (2006) Phylogenetic analysis of trophic associations. The American Naturalist, 168, E1-E14

8. Webb C.O., Ackerly D.D., McPeek M.A. & Donoghue M.J. (2002) Phylogenies and community ecology. Annual Review of Ecology Evolution and Systematics, 33, 475–505

Project Schedule

May 26-Jun 8: Com1A (2 weeks) -Code diversity metrics

Jun 9-18: Com1B (1.5) -Code regressions

Jun 19-Jul 6: Com1C (2.5) -Code bipartite analyses

Jul 7-Jul 14: Midterm evaluation and Com2 Learning Phase -Complete midterm -Learn the basics behind current R randomizations and of writing C/C++ extensions.

Jul15-Aug 10: Com3 (3.5) -Code randomization functions. -Edit picante functions to use the quicker randomizations.

Aug 11-17: Document and debug Code

Aug 18-Sep 1: Final evaluation

Weekly Updates

Community Bonding - Met Kembel and talked about project - Installed subversion - Learned about picante

Week of May 26 - Code PSV, PSE, PSR - Code PSR vs. Area