R Hackathon 1/Documentation

From NESCent Informatics Wiki
Revision as of 12:07, 10 December 2007 by Res20 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

Vignettes and Documentation

  • Vignettes require Sweave as part of the package building process so if we want officially R integrated vignettes, we have to use Sweave. You need to know LaTeX but the .Rnw file format that gets turned into LaTeX by Sweave isn't a big step. You also have to have access to a LaTeX installation. So, it has quite a lot of entry requirements but once you're there it is a good system for creating nice-looking documentation and also keeps a direct link between the R code and the documents. --DavidOrme 05:16, 22 November 2007 (EST)

Good examples?

Important reading matter?

R programming books

Here are a references on R programming (the first two I just got but look good):

This book has sections on writing functions, methods, classes (including S4 classes), and working with objects and databases in general

  • Programming with Data: A guide to the S language, by John M. Chambers, 1998, ISBN is 0-387-98503-4 [1]


This book is much more brief, has chapters on syntax, calls, classes, calling C compiled code

  • S Programming, W.N. Venables, B.D. Ripley, 2000 [2]


A general reference for the Programming Language

  • The New S Language, R.A. Becker, J.M. Chambers, A.R. Wilks, 1988 [3]

-M. Butler

R package writing

Most of this is probably well-known to all attendees who need to know it, but for the sake of reference here are a few links:

It may also be useful to consult BioConductor's guidelines and standards:

Package citation

R packages can have a file CITATION in their package structure (in the inst/ sub-directory), giving the information that will be printed by the R function citation(packageName).

For example this is the CITATION file for the base package (that comes with every standard R installation):

citHeader("To cite R in publications use:")

citEntry(entry="Manual",
         title = "R: A Language and Environment for Statistical Computing",
         author = person(last="R Development Core Team"),
         organization = "R Foundation for Statistical Computing",
         address      = "Vienna, Austria",
         year         = version$year,
         note         = "{ISBN} 3-900051-07-0",
         url          = "http://www.R-project.org",
         
         textVersion = 
         paste("R Development Core Team (", version$year, "). ", 
               "R: A language and environment for statistical computing. ",
               "R Foundation for Statistical Computing, Vienna, Austria. ",
               "ISBN 3-900051-07-0, URL http://www.R-project.org.",
               sep="")
         )

citFooter("We have invested a lot of time and effort in creating R,",
          "please cite it when using it for data analysis.",
          "See also", sQuote("citation(\"pkgname\")"),
          "for citing R packages.")

It gives the following output from the citation() function:

> citation()

To cite R in publications use:

  R Development Core Team (2006). R: A language and environment
  for statistical computing. R Foundation for Statistical
  Computing, Vienna, Austria. ISBN 3-900051-07-0, URL
  http://www.R-project.org.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {R: A Language and Environment for Statistical Computing},
    author = {{R Development Core Team}},
    organization = {R Foundation for Statistical Computing},
    address = {Vienna, Austria},
    year = {2006},
    note = {{ISBN} 3-900051-07-0},
    url = {http://www.R-project.org},
  }

We have invested a lot of time and effort in creating R, please
cite it when using it for data analysis. See also
‘citation("pkgname")’ for citing R packages.

>

S4

These documents contain useful, yet sometimes quite old, information about S4 classes.

  • S4 classes in 15 pages: a quite complete overview of S4 classes. Note that the instruction p5 to declare a listOrNULL class seems outdated ; it can be replaced using setClassUnion("listOrNULL", c("list","NULL")).
  • F. Leisch presentation of S4 classes: a short yet complete document, taking the example of the pixmap package. I did not find any outdated instructions (it was designed for UseR 2004).
  • S4 description by J. Chambers himself: very interesting document but few examples taken about S4 classes and methods. Maybe reading one of the two first documents would be useful before.

I have taken these references from R wiki ([4]). Other useful links can be found there. Please add yours!

--Jombart 04:57, 30 November 2007 (EST)

Version control

Most of us will be using Subversion for revision (or version) control (only if you are using an existing source code repository that is cvs-based will your version control software be cvs). The command-line client of Subversion is svn, which is why you may often find svn used as short for Subversion. Most of the following is shamelessly stolen from the BioConductor summary page on svn, with a few additional comments or examples.

Learning about Subversion:

Getting the code:

  1. Obtain an account at R-Forge.
  2. Request developer status with the PhyloConductor project.
  3. Make sure you can run ssh from the command line.
  4. Upload ssh public key to your account at R-Forge.
  5. Install a subversion client. (or Tortoise or svnX)
  6. Check out the code:
    • svn checkout svn+ssh://developername@svn.r-forge.r-project.org/svnroot/phyloc

Basic svn operations:

  • svn commit commits your changes. You can choose to specify a file (or files) or leave it blank and it will commit everything.
    • Don't leave the commit message blank (you can also specify it on the command line using the -m option) - you control how useful it will be later to you and others trying to identify why a particular change was made.
    • The commit message should be succinct, and should not state things that svn will record already (namely, who made the commit, when it was made, which file(s) it affected, and which lines in the file). Rather, state the motivation for the changes you made as succinctly as possible (e.g., "Fixed bug in NEXUS parser that prevented node labels from being read.", or "Added support for HKY model.").
    • It is OK to commit changes that don't fully work yet, just state state that in your commit message (e.g., "Added support for continuous characters. Not tested yet.").
    • Annoyingly, by default svn does not have an editor for writing the commit message pre-configured. Either, set the EDITOR environment variable, or customize your local svn configuration by editing the file $HOME/.subversion/config and adding a line
      editor-cmd = vi
      or whatever your favorite editor is (you may just have to uncomment and modify this line).
  • svn update will update your checkout from the server to get any new changes.
  • svn resolved foo declares that a conflict has been resolved.
  • svn add foo will add foo to the repository (note that unlike CVS this is a recursive add. Use the -N switch if you don't want this behavior).
  • svn delete foo will delete foo. If foo is a file it is removed from your local copy as well. If it is a directory it is not but is scheduled for deletion.
  • svn copy foo bar will make a copy of foo named bar and copy the history.
  • svn move foo bar is the same as copy except foo gets deleted.
    • This is the much preferred way of renaming files under version control (rather than copying them by hand), because it will be recorded in the log where a file was copied (or renamed) from.
    • This can be equally well applied to directories - in svn, directories are as much versioned as files are.

More advanced commands:

  • svn status foo will show you information about the file, particularly changes that you've made.
  • svn diff foo will show you the exact diff of your changes to the server
  • svn revert foo will bring you back to the server copy.
  • svn log foo will show the log history for that file.
    • Using the --verbose switch and a directory instead of a file (e.g., svn log --verbose .) yields a log of all commits in that directory (and subdirectories) with the commit messages, and for each commit which files were added, modified, and deleted. This lets you determine the revision in which a file was deleted that you want to restore.
    • For files that were copied using svn copy, the log will show which file (and version) it was copied from.
  • svn copy -r 1234 <URL of server>/path/to/foo foo restores (by copying, and therefore recording that act in the log) foo from revision 1234 if that was the last version of foo before it was deleted (use svn log to figure out the revision to restore from).

Many of these commands have extra possible arguments. You can get help on diff, for example, like this:

$ svn help diff

If you forget the name of the command, simply typing

$ svn help

will result in a list of possible commands with short explanations. The Subversion Book will have more complete documentation and examples for all the commands and options.

Q & A

Here are some questions we are facing, and answers, pointers to further information, and comments. Most of the (edited) answers are from Wolfgang Huber, a core developer of Bioconductor.

S4 and design of package and data structures

Q: Use S4 objects?

  • S4 is great. Definitely use S4, it allows modular design, extensibility and encapsulation. Maybe not too early, often I have found it useful to write the first version using the atomic data structures of R, or S3, then once the structures are clearer, move stuff to S4.
  • There can be performance penalties with S4, so some thought on what are the frequent and expensive operations and making sure they can be vectorized and put into the bottom level of a complex dispatch-hierarchy is important.

Q: Priority of designing S4 classes?

  • Having scientifically sound functionality, properly defining the use cases, and writing good documentation are more important.
  • Also, the use of Namespaces and detailed import directives in the package NAMESPACE file (to avoid name masking). I would place using S4 as the most important thing after these.

Q: What are the lessons learned from Bioconductor that we should heed? What works well and what doesn't?

  • Avoid premature optimisation. Get a simple working version of the program quickly, then iteratively improve it at those places that need it (rather than the top-down approach of a heavy design-phase followed by implementation; which rarely works).
  • Mostly this is generic software engineering. Discussing the relevant use cases is important, to avoid over-designing and trying to stuff in too many features. It makes sense to start with simple use cases and implementations and then extend them as the need arises.
  • I have found that writing package vignettes early own, that try to explain how to get a certain task done, often have helped me enormously in deciding on the proper functionality and interface of a piece of software.

Common data structures

Q: Different packages for phylogenetic and comparative analysis use different ways of internally representing the data (and for I/O). Is this important to address?

  • That is a major issue. Using a common datastructure as much as possible is really crucial, this makes it so much more efficient to do methods comparisons and to combine methods. Obviously, it can be hard and time-consuming to get different authors to agree on anything, and priority can go to those types of data that most need to be shared.
  • Rather than abstractly defining a datastructure and then hoping for someone to implement it, it is probably more practical that someone takes the lead, decides on representation and interfaces (and here S4 really helps), and makes it useful enough for others to want to use it. The new ExpressionSet is an example.

Code in R versus code in C

Q: Are there any rules of thumb that have proven or emerged as useful from Bioconductor?

  • Implement everything in R, then profile, and move those parts that are time-critical and cannot be easily optimised into C. Moving to C too early is a waste of (developers') time - which is far more expensive than CPU time.
  • Be sure to understand and use the vectorized functions of R, as well as functions such as split, table, strsplit.
  • Of course, if you already have legacy code in C, use it.

Interfacing with external programs

Q: Are there particular issues with this in R, and what are the utility methods being provided by R? Tips about parsing input into R data structures and writing output that are special to R?

  • For interfaces that work via shared memory, it works best in C, can be somewhat harder with C++ or Java. Have a look at the "Writing R extensions" manual.
  • Communicating with other tools via files is usually trivial, and there exist import and export filter for many different file formats. See the R Data Import/Export manual on CRAN: http://www.stats.bris.ac.uk/R
  • It is of course always preferable to use existing, well-tested code over reimplementing stuff. OTOH, portability can suffer when too many different tools are necessary on the host system! (For some applications it is not crazy to think of running R on a playstation or a mobile phone).
  • There is also RCurl and the XML package for talking to webservices.

Visualization

Q: Good reference material for R's advanced plotting facilities?

Documentation

Q: Reference material and/or recommended reading on how best to write vignettes?

  • I think it is the same as with any writing - pick some favorite examples and develop your own style from there. For example, the "vsn" package's vignette (.Rnw source) :)
  • The "weaver" package is very useful, because it allows to "cache" results of code chunks so you can recompile an .Rnw file without redoing all the computations that have not changed.