Difference between revisions of "R Hackathon 1/Programming Goals"

From Phyloinformatics
Jump to: navigation, search
(New functions)
 
(16 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Creating a standard ==
+
[[R Hackathon 1/End User Goals|End-user goals]] are on a separate page.
  
There are many ways to represent trees in R, and not all of them can be converted from one to the other [see diagram of possible conversions below]. [[Image:Treeformats.png]]
+
== Goals from a programmer perspective ==
  
One major goal of the hackathon is establishing stable standards for trees, networks, data, and perhaps model descriptions and analysis settings. When we are done, we should be able to load a tree and character data from a NEXUS file (see [[Supporting_NEXUS_Documentation]] and [[NEXUS Specification]]) and then use functions from various packages without requiring conversion of the objects holding the tree(s) and data. New R packages for comparative methods will use this standard; existing packages will ideally be modified to meet this standard, but will hopefully be able to at least easily convert to and from this standard from and to their internal representations. If even this is not possible, we could create a package or modify an existing package to easily convert from and to all the currently-implemented tree and data representations for comparative methods. Desired features [''subject to discussion, as is almost everything''--[[User:Bco|Bco]] 15:31, 12 November 2007 (EST)] include rooting, branch lengths, tree weights, information about ancestral state reconstructions/assignments, and labels.
+
=== Creating an interoperable data representation in R ===
  
We have a test set of sample trees and datasets ([https://www.nescent.org/wg/phyloinformatics/images/1/19/RHackTrees.zip RHackTrees.zip])which contain various formats for input (trees with and without branch lengths, rooting, labels, etc.). In order to better get an idea of the structure of trees and data within each package, we'd like to get the internal representation of the files from each package, as well as notes regarding the procedure for loading the files and what was lost (for example, is it clear which trees are rooted and unrooted?). See, for an example of a description of a format, Paradis' description of the [http://pbil.univ-lyon1.fr/R/ape/misc/FormatTreeR_4Dec2006.pdf phylo class] (pdf) or the scheme for coding [http://pbil.univ-lyon1.fr/R/ape/misc/BitLevelCodingScheme_20April2007.pdf nucleotides] (pdf) in [http://pbil.univ-lyon1.fr/R/ape/ APE].
+
There are many ways to represent trees in R, and not all of them can be converted from one to the other (see diagram of possible conversions). [[Image:Treeformats.png|right|512px]] A major goal of the hackathon is establishing stable standards for trees, networks, data, and perhaps model descriptions and analysis settings.  
  
We also have compiled all the help documents for functions related to comparative methods in R into a single searchable, sortable [http://www.brianomeara.info/allpackages3.html table], with all the functions categorized into major and minor types. This may help, for example, find all the relevant functions for plotting trees from various packages.  
+
Benefits if such a standard is implemented:
 +
* We should be able to load a tree and character data from a NEXUS file (see [[Supporting_NEXUS_Documentation|Supporting NEXUS]] and [[NEXUS Specification]]) and then use functions from various packages without requiring conversion of the objects holding the tree(s) and data.
 +
* New R packages for comparative methods will use this standard.
 +
* Existing packages will ideally be modified to meet this standard, but will hopefully be able to at least easily convert to and from this standard from and to their internal representations. Alternatively, we could create a package or modify an existing package to easily convert from and to all the currently-implemented tree and data representations for comparative methods.
  
 +
Desired features:
 +
* rooting
 +
* branch lengths
 +
* tree weights
 +
* information about ancestral state reconstructions/assignments
 +
* labels
  
== Interaction ==
+
The page for working on the standards is [https://www.nescent.org/wg_phyloinformatics/R_Hackathon_1/Data_Standards here].
  
R can call external programs and scripts (see more info [http://cran.r-project.org/doc/manuals/R-exts.html#System-and-foreign-language-interfaces here]). Currently, those comparative methods and related approaches that are not present in R are generally implemented in standalone programs (commonly written in C or C++; see Felsenstein's [http://evolution.genetics.washington.edu/phylip/software.html list] of programs using/making phylogenies) or as modules for the Java program [http://mesquiteproject.org/mesquite/mesquite.html Mesquite]. Developing the ability for users to run such analyses from within R will reduce the amount of re-coding of particular methods required.
+
We have a [[Media:RHackTrees.zip|test set of sample trees and datasets]] which contain various formats for input (trees with and without branch lengths, rooting, labels, etc.) useful in evaluating loading and representation of these data in individual packages; the following information is summarized [https://www.nescent.org/wg_phyloinformatics/R_Hackathon_1/Current_Data_Representations here]:
 +
* document the internal representation of the files from each package, as well as notes regarding the procedure for loading the files (see the [http://pbil.univ-lyon1.fr/R/ape/misc/FormatTreeR_4Dec2006.pdf 'phylo' class] (pdf) or the [http://pbil.univ-lyon1.fr/R/ape/misc/BitLevelCodingScheme_20April2007.pdf scheme for coding nucleotides] (pdf) in [http://pbil.univ-lyon1.fr/R/ape/ APE] as examples)
 +
* what was lost (for example, is it clear which trees are rooted and unrooted?)
 +
 
 +
=== Consolidate redundant and fill in missing functionality ===
 +
 
 +
We have compiled all the help documents for functions related to comparative methods in R into a single searchable, sortable [http://www.brianomeara.info/allpackages3.html table], with all the functions categorized into major and minor types. This may help, for example, find all the relevant functions for plotting trees from various packages. If  your package is not here yet, and you're comfortable releasing info on its functions, please email [mailto:bcomeara@nescent.org Brian O'Meara].
 +
 
 +
We also have overviews of the relevant packages [https://www.nescent.org/wg_phyloinformatics/R_Hackathon_1/Package_Overviews here].
 +
 
 +
=== Interfacing with other programs or architectures ===
 +
 
 +
R can call external programs and scripts (see [http://cran.r-project.org/doc/manuals/R-exts.html#System-and-foreign-language-interfaces R manual]). Currently, those comparative methods and related approaches that are not present in R are generally implemented in standalone programs (commonly written in C or C++; see [http://evolution.genetics.washington.edu/phylip/software.html Felsenstein's list of relevant programs]) or as modules for [http://mesquiteproject.org/mesquite/mesquite.html Mesquite] (see overview of [http://mesquiteproject.org/mesquite/features.html Mesquite's features]).
 +
 
 +
Developing the ability for users to run such analyses from within R will reduce the amount of re-coding of particular methods required. Several relevant programs ([http://paup.csit.fsu.edu/ PAUP], [http://mrbayes.scs.fsu.edu/ MrBayes], [http://mesquiteproject.org/mesquite/mesquite.html Mesquite], [http://loco.biosci.arizona.edu/r8s/ r8s], [http://www.brianomeara.info/brownie Brownie], others?) can be run by passing them NEXUS files with trees, data, and a block of commands and their output(s) can be specified; other programs can be batched, but not directly using NEXUS files ([http://evolution.genetics.washington.edu/phylip.html PHYLIP] programs, [http://www.evolution.rdg.ac.uk/BayesTraits.html BayesTraits]).  
  
 
Particular goals:
 
Particular goals:
    * Mesquite calling R modules.
+
* Mesquite calling R modules.
    * R calling Mesquite modules.
+
** See [http://www.rforge.net/JRI/ JRI] for Java -> R calls
    * R calling existing software (PAUP? MrBayes? others?)
+
* R calling Mesquite modules (see somewhat outdated [http://mesquiteproject.org/mesquite/development/Mesquite_Development.html documentation for developing for Mesquite])
 +
** See [http://www.rforge.net/rJava/ rJava] for R -> Java calls
 +
* R calling existing software as external programs (PAUP? MrBayes? others?)
 +
 
 +
=== New functions ===
 +
 
 +
See [[R_Hackathon_1/End_User_Goals|end user goals]] for a discussion of missing functionality.
 +
 
 +
=== S3/S4 ===
  
== New functions ==
+
Note a page of extensive S4 info (mostly as pdfs) [http://wiki.r-project.org/rwiki/doku.php?id=tips:classes-s4 here].
  
See [https://www.nescent.org/wg_phyloinformatics/R_Hackathon_1/End_User_Goals end user goals] for a discussion of functionality needing to be added.
+
[[Category:R Hackathon 1]]

Latest revision as of 13:39, 30 November 2007

End-user goals are on a separate page.

Goals from a programmer perspective

Creating an interoperable data representation in R

There are many ways to represent trees in R, and not all of them can be converted from one to the other (see diagram of possible conversions).
Treeformats.png
A major goal of the hackathon is establishing stable standards for trees, networks, data, and perhaps model descriptions and analysis settings.

Benefits if such a standard is implemented:

  • We should be able to load a tree and character data from a NEXUS file (see Supporting NEXUS and NEXUS Specification) and then use functions from various packages without requiring conversion of the objects holding the tree(s) and data.
  • New R packages for comparative methods will use this standard.
  • Existing packages will ideally be modified to meet this standard, but will hopefully be able to at least easily convert to and from this standard from and to their internal representations. Alternatively, we could create a package or modify an existing package to easily convert from and to all the currently-implemented tree and data representations for comparative methods.

Desired features:

  • rooting
  • branch lengths
  • tree weights
  • information about ancestral state reconstructions/assignments
  • labels

The page for working on the standards is here.

We have a test set of sample trees and datasets which contain various formats for input (trees with and without branch lengths, rooting, labels, etc.) useful in evaluating loading and representation of these data in individual packages; the following information is summarized here:

  • document the internal representation of the files from each package, as well as notes regarding the procedure for loading the files (see the 'phylo' class (pdf) or the scheme for coding nucleotides (pdf) in APE as examples)
  • what was lost (for example, is it clear which trees are rooted and unrooted?)

Consolidate redundant and fill in missing functionality

We have compiled all the help documents for functions related to comparative methods in R into a single searchable, sortable table, with all the functions categorized into major and minor types. This may help, for example, find all the relevant functions for plotting trees from various packages. If your package is not here yet, and you're comfortable releasing info on its functions, please email Brian O'Meara.

We also have overviews of the relevant packages here.

Interfacing with other programs or architectures

R can call external programs and scripts (see R manual). Currently, those comparative methods and related approaches that are not present in R are generally implemented in standalone programs (commonly written in C or C++; see Felsenstein's list of relevant programs) or as modules for Mesquite (see overview of Mesquite's features).

Developing the ability for users to run such analyses from within R will reduce the amount of re-coding of particular methods required. Several relevant programs (PAUP, MrBayes, Mesquite, r8s, Brownie, others?) can be run by passing them NEXUS files with trees, data, and a block of commands and their output(s) can be specified; other programs can be batched, but not directly using NEXUS files (PHYLIP programs, BayesTraits).

Particular goals:

  • Mesquite calling R modules.
    • See JRI for Java -> R calls
  • R calling Mesquite modules (see somewhat outdated documentation for developing for Mesquite)
    • See rJava for R -> Java calls
  • R calling existing software as external programs (PAUP? MrBayes? others?)

New functions

See end user goals for a discussion of missing functionality.

S3/S4

Note a page of extensive S4 info (mostly as pdfs) here.