Phylohackathon 1/Proposal

From Phyloinformatics
Jump to: navigation, search

An agreed upon proposal was developed by Todd Vision, Hilmar Lapp, Arlin Stoltzfus, Aaron J. Mackey, Mark Holder, and Rutger Vos through teleconferences and discussion on a mailing list.

Introduction

NESCent is interested in sponsoring hackathons in order to promote the following objectives:

  • interoperability of software across disciplines or domains
  • development and adoption of standards for data exchange
  • greater usability of evolutionary informatics tools (software, databases)

In keeping with these objectives, NESCent proposes a hackathon for late 2006 to work on support for execution and intercommunication of phylogenetic analysis software within the Bio* framework.

Suggestions

The following suggestions were submitted in response to the action item resulting from the Aug 2, 2006, conference call.

Below, we synthesize a proposal based on these suggestions that is in keeping with NESCent's mission.

Proposal

Goal

The goal of the proposed hackathon is to create interfaces (APIs) and implementations in widely used toolkit libraries so that by using those toolkits, researchers and developers of analysis pipelines can effectively execute important evolutionary analysis programs, pass input to those programs, parse their output back into the toolkit's object model, and possibly store the results of that parse for later retrieval.

More specifically, we choose the Bio* framework of toolkits (BioPerl, Biojava, Biopython, Bioruby, Bioprolog), which includes the most widely used programming toolkits in the life sciences in several programming languages. The Bio* framework also already defines BioSQL as its interoperable persistence model and interface for storing objects.

Background

As an example, BioPerl as the most advanced toolkit for 'glueing' together analysis programs into pipelines or workflows defines various wrappers around common sequence analysis programs in the Bio::Tools::Run hierarchy, the 'bioperl-run' package. For programmers of sequence analysis workflows this framework has been very useful, as it provides programmatic access to command-line parameters, input data, and output of the program to be executed through an object-oriented API, consistent with the rest of the toolkit. For example, sequences or alignemnts are provided through setter or getter methods as Bio::SeqI or Bio::SimpleAlign objects. The program is executed through

$run_instance->run()
Parameters, and input and output objects, are (should be ...) documented in the "Synopsis" section of the module's documentation.

Notably, as can be seen at the package documentation, support for phylogenetic analysis, let alone other evolutionary analysis programs, is largely absent.

Similarly, while storing sequences, features, and annotation in BioSQL, the Bio* interoperable persistence layer, is well supported in most of the toolkits, storing phylogenetic trees is not supported in any; in fact it has not even been demonstrated that the BioSQL relational model can accommodate the data structure.

Organization of the Hackathon

We propose to organize the hackathon around subgroups of 3-5 participants each, with a total of 20-25 or less participants. Each subgroup would be tasked with attacking a problem in one of the toolkit libraries that is identified as in line with the goals of the hackathon.

Scope of problems to be attacked

Each problem should be scoped narrow enough so that significant progress on solving the problem can be made by the subgroup attacking it over the duration of the hackathon. Significant progress means that all API or model (object or relational) definitions have been completed, and that reference implementations either are in place, or can be easily adapted from similar methods.

The idea is that those parts that are intellectually most challenging and benefit most from focused time and effort as well as from frequent and direct interactions are completed at the meeting itself, whereas continuing work needed after the conclusion of the hackathon is limited to testing, fixes, improving documentation, changing hard-coded paramaters to options, and filling in missing methods by copy&pasting from similar reference implementations.

Use cases and the selection of problems

There are different approaches to selecting the problems and thereby defining the subgroups, and each will cause a different bias in and benefit from the selection.

  • Participant-elected: participants are nominated based on their expertise in the targeted toolkits or projects and elect the problem they want to work on. This approach typically provides for the most enthusiasm of the participating developers. However, community impact of the solutions is not guaranteed, let alone maximized.
  • Community-elected: the community of users (i.e., phylogeneticists and evolutionary biologists) is solicited to submit problems to be addressed (and prioritized), participants are nominated by the user community or selected on their commitment to solve the submitted problems. The impact and enthusiasm this approach generates hinges on the intuition, foresight, and involvement of the submitters. It tends to suffer from the fact that 10 solicited researchers often have 11 different opinions, none of which may have intrinsic generality.
  • Use case-driven: problems are defined by a gap analysis of use cases identified a-priori as being important (e.g., due to the frequency they are encountered, or the 'hotness'-factor of the research they enable). While the problems selected by this approach are not necessarily the 'coolest', their solution by definition maximizes impact for the end user, provided the driving use cases were selected right. This in turn will foster recognition and provide reward for the participating developers.

Here, we propose an approach that is primarily use case-driven. The organizers will identify and develop the use cases a-priori. The use cases as well as the problems based on them will be subjected to feedback from the user as well as the developer community. It should be noted that although both communities are not identical, many developers of relevant source code are also researchers.

Subgroups and tasks

We propose the following subgroups and tasks. Participants will be nominated and contacted by/through the respective developer communities, and by the organizers themselves. Self-nominations of participants are welcome.

  1. Bioperl: complete the support of phylogenetic and evolutionary analysis programs in Bio::Tools::Run
    • List of tools to be supported: to be identified
    • Participants: T.B.D.
  2. BioSQL: demonstrate that the model encompasses phylogenetic trees (and character matrices too?). If necessary extend the model to reach this goal. Add object-relational mapping for Bioperl Bio::Tree::TreeI and Bio::Tree::NodeI objects to Bioperl-db.
    • Participants: T.B.D.
  3. Biopython: establish object model for phylogenetic trees (there seems to be a NEXUS parser with associated classes?). Establish interface for executing external analysis programs (using Bio.Application as a basis? is there something more appropriate?]. Create implementations for the same set of programs as for Bioperl.
    • Participants: T.B.D.
  4. Biojava: create object model for phylogenetic trees (there seems to be no specific support yet except for the org.biojava.bio.taxa package). Establish interface for executing external analysis programs (based on the org.biojava.bio.program package and sub-packages). Create implementations for the same set of programs as for Bioperl.
    • Participants: T.B.D.
  5. Bioruby: create object model for phylogenetic trees (there doesn't seem to be any yet). Establish interface for executing external analysis programs (this may not be needed). Create implementations for the same set of programs as for Bioperl (there is no support yet for phylogenetic analysis programs).
    • Participants: T.B.D.
  6. Blipkit: create support for phylogenetic trees (what does this equate to in Prolog? use the phylo_db module as a basis? There is also a New Hampshire format parser.). Establish interface for executing external analysis programs and create implementations for the same set of programs as for Bioperl.
    • Participants: T.B.D.
  7. CIPRES/BioCORBA: create proof-of-concept for utilizing Bioperl's Bio::Tree::Tree/Node model (and the ability to programmatically execute phylogenetic analysis programs?) from within programs written using the CIPRES framework through a CORBA bridge.
    • Participants: T.B.D.

There are additional possibilities for subgroups, from which some may be chosen to replace subgroups initially proposed to invite.

Planning Steps