R Hackathon 1/Teleconferences

From Phyloinformatics
Jump to: navigation, search

2nd Teleconference 11/30/2007

Agenda

  1. Updates on action items
  2. Test datasets
  3. Data standards
  4. Presentations
    • Update poll on useful lightning talks
  5. Documentation and bootcamps

Synopsis

  1. Package overviews
    • Encouraging package owners to discuss their package in the package overviews page on the wiki (what each package does, future directions and plans for each package), and also to document any data structures their package uses on the current data representations page.
    • We will ask everyone to do a very quick introduction of themself at the beginning of the hackathon - who are we, why are we at the hackathon? Package owners will be asked to give a very short (<5 minutes, no slides) overview of their code. Emphasis on recap of package goals and features, and plans for future development.
  2. End user goals
    • Encouraging all to read and prioritize end-user goals on the wiki. Need to start narrowing down the list of potential goals (we will need to decide at the meeting).
    • Several noted the potential to implement not just a few specific methods but a more general emerging framework for likelihood based methods, creating an extensible base that others can build on and taking advantage of R's existing likelihood functions.
    • Some discussion of whether to re-implement existing methods natively in R. Seems to be some consesus that there are costs and benefits to re-implementing things natively in R, benefits including being able to cross-validate and that once implemented they could form an important framework for future extension and use. All agree the priority should be on creating fundamentals that can be built on in the future. Other specific suggestions included moving beyond Brownian motion and incorporating extinction as ways to move the field forward in the long term.
    • Question of what type of end-user are we targeting - total newcomers to R or more experienced programmer types? Do we want to broaden user base or just improve existing functions? All agree on importance of having worked examples and interoperability among packages and with other software to make it friendlier to end-users.
  3. Programming goals
    • Redundancies in existing code (e.g. eight different implementations of Brownian trait evolution). Need to discuss what we do about this - do we merge code, keep separate code bases, ? More generally, what do people suggest developers do with their existing packages if they are somewhat heterogeneous collections of functions - should they be split up, merged into other existing packages, ? We will discuss further.
    • Mesquite can now call out to R functions (and vice versa). Please email Maddison with your ideas for priorities for connections between Mesquite and R. What are the most important R functions people would like to be able to call from Mesquite?
    • Discussion of what to do with the code we produce at the hackathon. Suggestion is that we produce a hackathon package that will contain everything we produce, including new and revised data standards. Others can then merge code from/to this package and or depend on this package. This is how it is done currently with the sp spatial package, which provides a data standard and set of functions used by numerous other packages.
    • Need to decide how to give credit - currently many packages give credit in individual function documentation and package authors may be any person who contributed any code to the package.
  4. Source code repositories and version control
    • Encourage all package owners to fill in the source code management table on the wiki.
    • Contact Hilmar Lapp with any questions about using/setting up a source code repository or version control software (such as Subversion).
    • Consensus seems to be that we will have everyone place their code on R-Forge or a similar public source repository for the hackathon. Individual package owners can decide whether to branch their own code base for the hackathon if they are using version control.
  5. Benchmarks and test data sets
    • Let's share some big datasets and collect data from existing packages that can be used to benchmark and test code in different functions.
    • Data can be uploaded to the wiki (public) or to the WebDAV server (password protected).
  6. Data standards
    • Seems to be consensus that we start with the ape phylo tree structure as our data standard.
    • Discussion of whether we should build a S4 phylo data class. Suggested that we can implement a phylo class in S4 and include in the hackathon package for future use. We will try to start a discussion of new S3 and S4 data class proposals on the wiki.
    • If we implement a new standard in S4 or rework the existing standard, as long as we have a way to translate among structures the change will be transparent to the end-user. Future code can use phylo or any new standard and translate as appropriate.
    • Lots of discussion of whether to embed other data in the tree structure - do we extend the phylo class? Several arguments for and against, but all agree that we need more functions and standards for linking trait data to trees, and check for errors in these links.
  7. Data input/output
    • All agree it will be important to work on implementing reading/writing of Nexus and Newick data (esp. on non-tree data - i.e. character data in Nexus files).
  8. Bootcamps
    • Several short tutorials/bootcamps will be held to get people up to speed on topics of interest.
      • Paradis will hold a tutorial on Sweave and vignette writing.
      • Bolker will hold a tutorial on S4 classes and methods.
      • Several express interest in a SVN or using version control software bootcamp.

Action Items

  1. Everyone prepare for short introduction of themselves and their goals at the beginning of the hackathon.
  2. Package owners outline package functionality, goals, future directions, data structures on the wiki. Package owners also prepare a short overview of their package's goals/future directions at the beginning of the hackathon (<5 minutes, no slides).
  3. Discuss/prioritize specific end-user goals and functionality we most want to achieve.
  4. Start a discussion of new S3 and S4 data class proposals on the wiki, learn about S4 (resources on wiki).
  5. Convert test datasets and trees to package internal representations. Package owners document data representations on wiki.
  6. Look into ways to improve reading/writing of external data formats such as Nexus.
  7. Identify potential datasets to be used for benchmarking and comparisons, upload to wiki or WebDAV.
  8. Send help files for functions not already available here to Brian O'Meara.
  9. Learn how to write documentation, including vignettes, for R. This may include learning some LaTeX, though there are various programs to make using this easier.
  10. There will be vignette/documentation, S4, and SVN bootcamps. Identify any other potential bootcamp topics and prepare bootcamps.
  11. Package owners contact organizers to set up version control for code not yet under version control, or copy code snapshot from existing version control software if the repository isn't public.
  12. Decide which public source code repository to use (rforge? r-forge? sourceforge? Google Code?).
  13. Contact Wayne Maddison with your ideas for priorities for connections between Mesquite and R (what are the most important R functions people would like to be able to call from Mesquite)?

1st Teleconference 11/16/2007

Please note the planning steps page for additional information on specific items.

Agenda

Note for participants: all of the items following the first two are topics posed for discussion, rather than decisions already made. We encourage you to voice any and all feedback, including that pursuing the topic is not desirable (if that is what you feel).

  1. Welcome, review of purpose and general objectives (TJV)
  2. Outline of planning schedule ahead (HL)
  3. Introductions (everyone)
  4. Priorities and subgroups (SK, BCO) (5mins)
  5. Preparations by participants (HL) (25mins)
    • Presentations
      • Lightning (< 5-10mins) talks - purpose(s), topics, and presenters
      • Needs for full-length (> 15min) talks?
      • Brainstorming current or future challenges - useful topics, presenters?
    • Bootcamps - purpose, needs, and presenters
    • Compiling package-specific information on the wiki - e.g., overview, relevant programming info, future goals
    • Tabulating metadata across packages (as started by Brian O'Meara) - methods, supported formats and analysis methods, visualization capabilities
    • Describing internal representation of data from test files in each package on the wiki
    • Reading list - suggestions (both for possible gaps and for recommended reading)
    • Other preparations we or others can facilitate or help with?
  6. Documentation & testing (SP, SK) (5mins)
    • R documentation and vignette writing - who has experience, and would be willing to help train the users? Examples to draw from?
    • Collection of data for testing and validation (such as tree files for testing)
  7. Source code repository (or repositories) (HL) (5mins)
    • Survey of current source code repository and versioning setup for participating packages
    • Assessing need for help with publicly accessible repository
    • Assessing need(s) for NESCent-run repository
    • Code branching needs
    • Other source code-related preparation needs
  8. IT logistics (HL)
    • Computers - will we need loaners?
    • Network access - will anyone need wired network?
    • File share needs
  9. Other homework? (5mins)
  10. Q & A (10mins)

Attending: (this might not be a complete list please add yourself on if you have been missed! (or indeed remove yourself if you have been added erroneously)

Michael Alfaro, Marguerite Butler, Ben Bolker, Richard Desper, Joe Felsenstein, Luke Harmon, Andrew Hipp, Gene Hunt, Steve Kembel, Damien de Vienne, Wayne Maddison, Peter Midford, Brian O’Meara, Emmanuel Paradis, Samantha Price, Brian Sidlauskas, Stacey Smith, Peter Waddell, Todd Vision

Synopsis

1. Introduction - why are we here?

To agree on goals for the hackathon - get into meaningful subgroups - interact. There will be at least one more teleconference to finalise goals before the hackathon.

2. Priorities and subgroups

  • Data representation Standardizing data formats in R will be quite useful. It would be good to sketch out how this might look like before hand: rooting, branch lengths, labels, data at tips and nodes etc. Start this off over email list and wiki – please collaboratively edit.
  • Sub-groups – please follow the link to create new groups, move yourself etc.

3. Preparations by participants

  • Lightning talks - is there a need for quick talks at the start of the hackathon to let everyone know you are working on? It was generally agreed that, in order to minimize the time spent on presentations, everyone should aim to get as much of this information as possible onto the wiki. To that end, we have set up a wiki page intended for everyone to populate with overviews of each package, in particular documenting future directions and known problems. The decision of whether to do any lightning talks will be deferred until next conference call.
  • Brainstorming sessions - Do we need them? A suggestion was made to have an optional session that would include some of the remote participants. No final decision made.
  • Bootcamps - intensive tutorials. Are there technical pieces of knowledge that people need or do you know everything that you need to know before you arrive? It was generally agreed that we might need bootcamps on
    • documentation and vignette writing, including Sweave
    • programming S4,
and in addition we may need bootcamps too on
  • version control?
  • Interacting with Mesquite (from within R)?
  • Interacting with CIPRES (from within R)?

These need to be discussed on the wiki.

4. Documentation and testing

There will be several 'end-users' at the hackathon (Michael Alfaro, Samantha Price, Brian Sidlauskas, Stacey Smith, Amy Zanne) who have familiarity with comparative methods in R and varying degrees of coding experience - they will be interacting with the programmers to help write documentation and test code. We want to work out the best way of utilising their talents - is it writing vignettes, documentation or testing. What homework will the end-users have to do to achieve these aims? Please discuss these issues here.

5. Source code repository (or repositories)

Do existing packages have repositories? Does anyone use google code or sourceforge? No-one is currently using them but is it a good idea to start?

6. Misc. discussion

  • Are people interested in getting R to talk to Cipres? King and Butler – are interested - anyone else?
  • Can Mesquite (Java) serve as a calculation engine for missing info in R? It was generally agreed by the end-users that this would be useful. Wayne Maddison would like to explore this in preparation for the hackathon – please use the mailing list to let him know if you are interested in working with him on it.
  • Sub-groups do we need to include someone that can work on the post-nexus formats (e.g. Rutger Vos) ?

7. IT logistics

  • Loaner computers - please let us know if you need one
  • Shared file space - we will set up a WebDAV so that we can share documents (separate from the source code repositories and subversion etc.).

Action Items

  1. begin work on crafting data standards: what do we want, what formats should we consider
  2. convert test set of sample trees and datasets to package internal representation for each.
  3. write package overviews, including future goals
  4. send help files for functions not already available here to Brian O'Meara
  5. learn how to write documentation, including vignettes, for R. This may include learning LaTex, though there are various programs to make using this easier.
  6. start emailing each other for discussions
  7. miscellaneous:
    • investigate R->Java (Mesquite) process (see rJava)
    • look into linking to CIPRES, perhaps inviting additional participants for this
    • start prioritizing which new methods to add
    • think about source code repositories (rforge? r-forge? sourceforge? Google Code?) .