R PopGen Hackathon
The broad and inexpensive availability of modern next-generation sequencing and genotyping technologies have led to a wealth of data and analytical methods for investigating population genetics research questions. Natural selection and the resulting change of gene frequencies takes place at the lvel of populations, and thus population genetics research provides key insights into evolutionary mechanisms and history, including population structure, migration routes, gene flow between populations, divergence time estimation, association of phenotypes to genomic loci, and disease epidemiology. As a consequence, there is now a multitude of packages available for a large variety of population genetics analysis methods in the popular R platform for statistical and mathematical computing. The Genetics Task View alone mentions more than 30 packages, and population genetic analyses will routinely also involve phylogenetics and other bioinformatics packages and methods.
However, the organically grown wealth of methods and packages, combined with the exponential growth of datasets, has also created challenges for researchers to take full advantage of these resources. Not all of the existing R packages are actively used and maintained. Even though some packages work well together, many others do not or not enough. A common base class that provides efficient storage of genetic data and promotes interoperability was one of the original goals of the Rgenetics project, but was never implemented and remains lacking. The method implementations in many packages do not scale well to increasingly common large volume datasets. Individual population genetics packages often do not consider for their design how they will integrate well into beginning to end workflows, and as a consequence creating complex analysis workflows that require moving data, metadata, and other state information from one package to another can be painful.
To capitalize on the opportunities presented by the rich and growing landscape of large datasets, active R package user and developer communities, NESCent is sponsoring a hackathon with the aims to address the challenges described above, and ultimately to help foster an interoperating ecosystem of tools and resources for both users and researcher-developers. A hackathon, or codefest, is an event at which population genetics methods users, developers, and others with domain or technical expertise have the opportunity to collaboratively develop code, documentation, testable use-cases, and other resources that fill the gaps that are holding back their communities.
Goals and objectives
Participants will collectively identify specific goals and coding targets prior to and during the event. Broadly, however, the event is designed to target the following objectives.
- Increase interoperability between the many population genetics-related packages.
- Improve scalability for big data among population genetics packages.
- Create common core classes and scalable data structures for efficient storage of genetic data, including large-volume data.
- Develop a Vignette document (Rnw or Rmd) that works provides an example beginning-to-end workflow exercising the key packages. This would allow revealing the pain points in package interoperability, therefore focusing attention on how these problems can be reduced.
- Resurrect, or develop alternatives to package dependencies that are deprecated (such as genetics).
All code and documentation from the hackathon will be made available immediately and freely to the community under an open-source (OSI-approved) license.
Participation will be arranged through an open Call for Participation.
- Thibaut Jombart, Hilmar Lapp, Stéphanie Manel, Emmanuel Paradis, Bastian Pfeifer, Greg Warnes
- Vince Carey, a core developer of Bioconductor, has agreed to serve in an advisory role during the planning phase of the event.
The hackathon is slated to take place March 16-20, 2015 at NESCent in Durham, North Carolina.
Detailed planning steps are outlined and documented separately. In particular, the following activities are taking place:
Related external resources
R Packages and projects
- Old 'R-Genetics" project SVN repository:
- CRAN Task View: Statistical Genetics
- Bioconductor packages in category Genetics
- GStudio (R. Dyer)
- ff-package (big data issue)