R Hackathon 1/Proposal

From Phyloinformatics
Revision as of 18:31, 6 September 2007 by Hilmar (talk)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This proposal is being developed by Todd Vision, Hilmar Lapp, Amy Zanne, Samantha Price, and Brian O'Meara, based on an Informatics Whitepaper submitted by Samantha Price and Amy Zanne.

Synopsis

There is an increasing variety of statistical methods and tools for comparative evolutionary analysis of species trait data using phylogenetic information. Such methods are used, for example, to investigate questions of trait evolution, such as covariation between traits or correlation of traits with ecological and environmental factors, and many other questions about rates and mode of evolution. Most of these tools use idiosyncratic input and output formats, incompatible user-interfaces, and their capabilities overlap but no individual tool's feature set is comprehensive.

NESCent is interested in promoting interoperability, the support of data exchange standards, and greater usability of tools and methods across evolutionary bioinformatics. In keeping with these objectives, we propose to convene a hackathon on integrating comparative methods within the R statistical analysis platform.

Proposal

Background

Analyzing species traits in a comparative framework using phylogenetic information is a powerful way to improve the understanding of the many different aspects of the evolution of traits, adaptations, and species. There is an increasing variety of statistical methods and tools to rigorously test hypotheses about rates and mode of trait evolution, trait covariation, correlation of traits with ecological and environmental factors, host/parasite co-evolution, and others.

Aside from using different implementation platforms and user-interfaces, most of these tools use idiosyncratic input and output formats and incompatible data models. Their capabilities overlap but no individual tool's feature set is comprehensive enough or scales well enough to suffice for the vast majority of datasets or hypotheses one would like to use or test, respectively. Furthermore, it is difficult to determine which comparative methods are actually missing, and since very few of the existing implementations are built with reusability and extensibility in mind, implementing a new method typically results in a new tool being built from scratch.

From a scientist's perspective, this data analysis is often inherently statistical in nature, but many of the comparative methods are not easily embedded in a powerful statistical analysis platform such as R. Likewise, there is little if any integration of such methods or platforms with phylogenetic inference tools, such as PAUP*, MrBayes, etc.

This situation is in some ways similar to that in genomics at the advent of microarray-based technologies until the creation of Bioconductor, a collection of freely available open source packages for the R statistical analysis platform (which is open source software itself) that interoperate on the basis of a common data model and implement various bioinformatics analyses. Bioconductor transformed the field by putting rigorous data transformations, analysis methods, and statistical tests in the hands of bench researchers, and by enabling the rapid validation and dissemination of new methods. It made a vast body of work from other fields on testing statistical and probabilistic models readily accessible to the microarray community, because method developers can easily take advantage of them. Meanwhile, newly developed methods in microarray analysis are typically published as Bioconductor packages, allowing them to be rapidly adopted into existing R-based analysis pipelines.

Goals

The goal of the proposed hackathon is to bring together different people and groups who have started to develop comparative methods in the R platform, or who would like to integrate their methods into, or interface a tool with the R platform. The proposed event has the following objectives.

  1. Ensuring compatibility and data flow between packages, for example by agreeing on and implementing a common model for commonly used data.
  2. Proposing an organization of modules and functionality that increases code re-use and eases extensibility.
  3. Identifying areas (data types, package implementation rules, etc) interfacing with Bioconductor, and ensuring compatibility.
  4. Designing and implementing prototypes of functionality that is often needed but missing from the comparative method packages existing in R.
  5. Identifying comparative methods best suited to native integration into the R platform, and prototyping the integration for select targets.
  6. Identifying external tools or platforms best suited for interfacing with the R platform, and prototyping the interface for select targets.

The hackathon will concentrate on writing code. All code will be open source and immediately and freely available to the community.

Organization

We propose to organize the hackathon around subgroups of 3-5 participants each. Each subgroup would be tasked with attacking a set of problems towards one or more of the objectives of the hackathon.

Subgroups

Planning Steps