Difference between revisions of "R PopGen Hackathon"

From Phyloinformatics
Jump to: navigation, search
(Incorporating edits and comments from Stéphanie)
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<center><big>'''NESCent Hackathon on Population Genetics in R'''</big></center>
'''''The homepage of the [https://github.com/NESCent/r-popgen-hackathon Population Genetics in R Hackathon] has been moved to Github.'''''
NESCent is sponsoring a [http://en.wikipedia.org/wiki/Hackathon hackathon] to be held at NESCent in Durham, North Carolina, on March 16-20, 2015, with the objective to help foster an interoperating ecosystem of scalable tools and resources for population genetics data analysis in the popular R platform. The event is designed to target interoperability, scalability, and workflow building challenges among the many population genetics R packages that already exist. The gathering provides an opportunity for a diverse group of population genetics researchers, method developers, and people with other relevant areas of expertise to collaborate on code, documentation, use-cases, and other resources that will aid their communities.
Research into population genetics, a field interested in the variation of gene frequencies taking place at the level of populations, provides key insights into evolutionary mechanisms and history. Examples include population structure; migration routes; gene flow between populations; divergence time estimation; association of genomic variation to the environment and phenotypes; and the epidemiology of infectious disease outbreaks. The broad and inexpensive availability of modern next-generation sequencing and genotyping technologies have led to a wealth of data and analytical methods for investigating these and other population genetics research questions. As a consequence, there is now a multitude of packages available for a large variety of population genetics analysis methods in R, a popular platform for statistical and mathematical computing. The [http://cran.r-project.org/web/views/Genetics.html Genetics Task View] alone mentions more than 30 packages, and end-to-end population genetic data analysis routinely also involves other packages, such as from [http://www.bioconductor.org/packages/release/BiocViews.html#___Genetics molecular genetics] and [http://cran.r-project.org/web/views/Phylogenetics.html phylogenetics].
However, the organically grown wealth of methods and packages, combined with the exponential growth of datasets, has also created challenges for researchers to take full advantage of these resources. Not all of the existing R packages are actively used and maintained. Even though some packages work well together, many others do not or not enough. A common base class that provides efficient storage of genetic data and promotes interoperability was one of the original goals of the Rgenetics project, but was never implemented and remains lacking. The method implementations in many packages do not scale well to increasingly common large volume datasets.  Individual population genetics packages often do not consider for their design how they will integrate well into beginning to end workflows, and as a consequence creating complex analysis workflows that require moving data, metadata, and other state information from one package to another can be painful. Visualization of the results obtained from analytic methods or workflows is often important, but not all packages return results such that they are directly compatible with packages popular for visualization, such as ggplot2.
To capitalize on the opportunities presented by the growing amount of data, and the active researcher and R package developer communities, NESCent is sponsoring a [http://en.wikipedia.org/wiki/Hackathon hackathon] with the aims to address the challenges described above, and ultimately to help foster an interoperating ecosystem of tools and resources for both users and researcher-developers. A hackathon, or codefest, is an event at which population genetics methods users, developers, and others with domain or technical expertise have the opportunity to collaboratively develop code, documentation, testable use-cases, and other resources that fill the gaps that are holding back their communities.
== Goals and objectives ==
Participants will collectively identify specific goals and coding targets prior to and during the event. Broadly, however, the event is designed to target increasing interoperability between the many population genetics-related packages, and improving their scalability for big data. Targets along those objectives include, but are not limited to the following.
# Create common core classes and scalable data structures for efficient storage of genetic data, including large-volume data.
# Develop a Vignette document (Rnw or Rmd) that provides an example beginning-to-end workflow exercising the key packages. This will reveal the pain points in package interoperability and focus attention on how these problems can be reduced.
# Resurrect, or develop alternatives to package dependencies that are deprecated (such as genetics).
All code and documentation from the hackathon will be made available immediately and freely to the community under an  open-source ([http://www.opensource.org/licenses/alphabetical OSI-approved)] license.
== Participants ==
Participation will be arranged through an open Call for Participation.
== Organization ==
'''Organizing Committee'''
* Thibaut Jombart, [[User:Hlapp|Hilmar Lapp]], Stéphanie Manel, Emmanuel Paradis, Bastian Pfeifer, Greg Warnes
* [http://www.hsph.harvard.edu/carey/ Vince Carey], a core developer of [http://www.bioconductor.org Bioconductor], has agreed to serve in an advisory role during the planning phase of the event.
'''Logistics and Schedule'''
The hackathon is slated to take place March 16-20, 2015 at NESCent in Durham, North Carolina.
== Related external resources ==
=== R Packages and projects ===
* [http://sourceforge.net/p/r-genetics/code/HEAD/tree/trunk/  Old 'R-Genetics" project SVN repository]:
* [http://cran.r-project.org/web/views/Genetics.html CRAN Task View: Statistical Genetics]
* [http://www.bioconductor.org/packages/release/BiocViews.html#___Genetics Bioconductor packages in category Genetics]
* [http://dyerlab.github.io/gstudio/ GStudio (R. Dyer)]
* [http://cran.r-project.org/web/packages/ff/index.html ff-package (big data issue)]
* [http://ape-package.ird.fr/pegas.html pegas home page] with a doc describing the data classes, and a doc on how to read data files in various formats, including VCF
* [http://ape-package.ird.fr/ ape home page] with a short description of coalescentMCMC
=== Other tools and projects ===
* [http://snpeff.sourceforge.net/ SnpEff]: Genetic variant annotation and effect prediction toolbox.
* [http://vcftools.sourceforge.net/ VCFtools]:

Latest revision as of 19:46, 22 March 2015

The homepage of the Population Genetics in R Hackathon has been moved to Github.