- 2 Aug 06 - Arlin, Aaron, Mark, Hilmar, Todd
Suggestions for a Hackathon
There are many novel, powerful phylogenetic analysis tools (muscle, BAliPhy, HyPhy, newer PAML & MrBayes models) that have marginal or no programmatic "support" in BioPerl (or any other programming toolkit). By "support" I mean specifically:
1) the ability to fully interact and execute the tool via an (OO) programmatic interface
- local execution via command-line/config-file interface
- remote execution via RPC (web service, CORBA, or otherwise)
- both simple- and advanced-usage parameter handling
- robust error handling
2) an ability to obtain analysis results from the tool in a pre-parsed, object oriented manner, via:
- extended output parsers, robust to format changes, novel output "fields" and expert usage; this would likely benefit from the use of nested/swappable grammars for both efficiency and maintenance/evolvability, which is an approach not yet taken by most BioPerl parsers
- extended object models beyond the traditional singular Sequence, Alignment, and Tree objects, by which analysis parameters and results may be accessed (see #3a below)
3) an ability to "pipeline" tool input and output via common object (and associated serialization format) reuse, which in the field of phyloinformatics I propose to occur via:
- a composite sequences/alignment(s)/tree(s) object that tightly "binds" and manages relationships between components and tool parameters/results (e.g. PAML clade-specific site variation models defined along branches of the actual associated Tree object); this is a reference to the ongoing "Bio::CDAT" project
- a standardization on the use of Nexus-formatted text representations as the "raw" interchange format, until such a time when a new standard emerges (be it phyloXML or otherwise).
- potentially a relational schema extension to BioSQL to provide more structured and indexable storage than raw Nexus files would permit extending the Bio::*IO subsystems to target Nexus and/or relational schema(s)
CORBA and other "service-oriented architectures" (PISE, more generic Web Services, etc.) provide the means for multiple programming toolkits to reuse a common service architecture, but come at a significant burden in installation and usage complexity (which may be appropriate for an institutional-level service-provider installation, but not for the average scientist "hacker"). However, the BioPerl "Bio::Tools::Run" hierarchy of command-line tool "wrappers" already provides the foundation for much of the desired functionality mentioned above, including the ability to optionally invoke remote "web services" where available.
Therefore, I propose that a NESCent-sponsored hackathon would initially aim to achieve a more thorough instantiation of the existing BioPerl framework (efforts #1 and #2 above) for phyloinformatic use cases, and use the opportunity to brainstorm ideas about #3. Though this proposal is exclusionary to most other programmatic toolkits (BioJava, BioPython, BioRuby, etc.), it is only because of my unawareness of the phyloinformatic support in those toolkits. However, there is opportunity in this proposal for BioSQL extension work (and thus impact on the other toolkits via the ODBA)
I have three points to make i) regarding technology choices, ii) the users we are targeting, and iii) our development strategy.
First, I will just reiterate the two main ideas that came up in our telecon:
- develop interfaces to the CIPRES services in BioPerl, BioJava, whatever
- fill holes in evolutionary analysis options for BioPerl
Either way, if we want people to sit down and work on code together, this is going to take some serious preparation (individual reading, and group teleconferences to develop a clear understanding).
Second, I would like to stress the benefits of reaching out to a wider community than just those involved in the tree-of-life project. There are a huge number of applications of evolutionary analysis in genomics, of interest to a much larger group of scientists than just the ones doing the tree of life. Besides, I like to imagine the day when we can say to the anti-evolution forces in the US that genome annotation, personalized medicine and rational drug design are based on phylogenetic methods developed by evolutionary biologists decades ago, and that these methods are better than alternatives because it is more accurate to treat biological entities as things that diverged by an evolutionary process (as opposed to applying generic pattern-finding methods borrowed from computer science, as though biological things just fell out of the sky or were created).
Third, I would like to see us focus our development efforts so that, at the end of the day, we can carry out some specific kind of analysis, beginning with some pre-existing input data and ending with a result. It does not have to be a completely novel analysis, it could be something that has been done dozens of times before (in fact, this is a good place to start). Lately I have been working on a list of evolutionary informatics use-cases here:
This is for the benefit of a group of people interested in BioPerl (we are in the planning stages, with no funding), but there is nothing BioPerl-specific about it. Regardless of the technology (Bio::Run or CIPRES or something else) and regardless of the target audience (tree-of-life or genomics), I still think it is advisable to focus on a specific use case with the goal of applying and testing the software with it.
Of course I need to make these more specific so that we can use them to test software. For instance, for the first one about sequence evolution, I should find a recently published paper that includes an analysis of phylogeny and dN/dS for some sequence family (doesn't matter which), and that has supplementary data files with sequences and so on. Then we can set the goal of reproducing this published analysis using an entirely script-driven approach.
I don't want to come across as apathetic about this, but I do not have very strong opinions about the specific focus or a specific use- case for a hackathon. Not being a Perl programmer, I'm probably the least qualified to comment on the priorities/needs of BioPerl, and I've had a really hard time coming up with a catchy idea for the hack- a-thon. I think improving BioPERL/CIPRES interactions would be a great thing (certainly for CIPRES and quite possibly for both communities). However, CIPRES software has not reached a point of robustness or richness such that we could invite a group of programmers in and expect a week long programming session focussed entirely on CIPRES to be a fun, fast-paced experience that I think Todd has in mind.
I do like Aaron's suggestions of improving for using HyPhy and PAML from external scripting languages.
If no one objects I'll be a something of a lurker on these discussions. I'll chime in as thoughts occur to me, and I'd like to participate in the hack-a-thon (of course I understand if you choose not to invite me if the hack-a-thon it goes in a BioPerl direction and you want to keep the group small).
I couldn't see Arlin's use-cases on the www.molevol.org TWiki because I didn't have a log-in, but I do have one general comment. The idea of tacking distinct simple use-case is appealing for several reasons. I think that it is also really important to leave time pseudocoding or brainstorming about other use-cases. It is so easy to make decisions in one context that don't work at all when you want to add another analysis.
I'm also a little concerned that I may have sounded too negative during our teleconference. If one asks "What could CIPRES give to the BioPerl that could not be done by extending BioPerl?" then the answer is "not much right now," (and I think that it is important not to oversell CIPRES).
If the question is "Is CIPRES the best route for incorporating phylogentic inference tools into a pipeline designed for very high throughput bioinformatics analyses -- a context in which an estimated tree is produced in a second or two?," then again the answer is negative. Researchers that just want to incorporate some of the most obvious effects of phylogenetic history into their analyses, will gravitate to cheap-and-dirty tree estimation procedure.
CIPRES' main software goal is building a tool that infers tree for datasets of the size being produced by the tree-of-life groups. We are trying to tackle this by building a library to engage the computational phylogenetics community and makes it easier for them to work collaboratively. Most of our programmer and users care a lot about phylogenies and phylogenetic analyses, so we tend to target algorithms that run for hours or days (hence our inability to provide the actual computations as a general-access web service).
Given that Perl in general is so good at piecing together programs (and that BioPerl is much older, and more mature than CIPRES), it sounds a bit depressing for CIPRES if one couches the questions in terms of what CIPRES can deliver to BioPerl (this was how much of our conference call sounded to me). Ultimately, one of the best thing that CIPRES could do for BioPerl is broaden its user community by allowing programmers in C, Java, Python to use BioPerl implementations (with Rutger's code doing the adapting between BioPerl and CIPRES APIs). I think that this is actually a big contribution. I've seen lots of people (myself included) translate implementations from one language to another rather than deal with inter-language RPC. As a result our field has a lot of fragile and partially redundant implementations rather than a few well-tested and heavily used implementations. I think that it will be pretty exciting to see Menu Items for BioPerl analyses appearing in Mesquite menus.