Family Alignment Documentation

From Phyloinformatics
Revision as of 10:18, 14 December 2006 by (talk) (Analysis of Gene & Protein Families from Fungal Genomes)
Jump to: navigation, search


These pages describe approaches to characterizing sequence families in some set of related organisms. There are many ways to do this so our examples, though robust and proven by practice, are selected from a multitude of possible examples. The focus is on approaches that use a Bio* toolkit ([BioJava, BioPerl, BioPython, BioRuby) since these packages offer the user different workflow possibilities and practical, re-useable code but home-grown solutions are also discussed.

All of our examples are descriptions of bioinformatic workflows or pipelines. A workflow is comprised of a set of analytical applications, performing the analyses themselves, and some set of scripts that hand data to applications and take results from these same applications. One will also frequently encounter some set of critical filters (e.g. in English, "only write those sequences to a file that match the query sequence with a p < .0001" or "get all nucleotide sequences > 250 bp in length from the input file"), and these filtering steps may be performed by an application or by a script.

The characterized gene family, either as protein or nucleotide, is not necessarily an endpoint but is frequently the starting point for detailed phylogenetic analyses. For example one use case describes the analysis of the rate of silent substitutions within and between families. Another use case describes the task of comparing a species tree to gene family trees from those same species, and how one might resolve discrepancies between those structures (Reconcile Trees Documentation).

Analysis of Gene & Protein Families from Fungal Genomes

Based on the work of Jason Stajich (

Jason Stajich is working on use case 3.1, family creation, with a specific focus on family characterization, exclusive of annotation. There are a number of possibilities, unstated within the use case, which Jason will address in the course of coding.

  • The user might be interested in characterizing all possible families in the input sequences or only one given family, exemplified by some single sequence.
  • The workflow could be enacted by one script, or one or more scripts in a workflow. In the case of multiple scripts one needs define inputs and outputs such that the scripts could be chained together.
  • Any given step, such as aligning at a given point, could be executed by one of a number of different programs (e.g. clustalw or T_Coffee). It would not be easy to allow the user to choose any executable for a given step so some reasonable decisions need to be made.
    • Example: Tribe will be selected as the application that will be used to cluster sequences into families.
    • Furthermore there are open questions around speed and sensitivity that can't be resolved during the Hackathon. It is not clear that the choice of applications is optimal for a given user's preferences.

These scripts will end up as designated "BioPerl scripts" in a dedicated directory in the BioPerl distribution.

Genome annotation Applications


Analysis of Silent Substitutions Within a Genome or Across Genomes

Based on the work of Amy Bouck (