Difference between revisions of "Family Alignment Documentation"

From Phyloinformatics
Jump to: navigation, search
m (BioPerl & HyPhy)
Line 11: Line 11:
==User Stories==
==User Stories==
===Analysis of Gene & Protein Families from Fungal Genomes===
===Analysis of Gene
Based on the work of Jason Stajich (see http://fungal.genome.duke.edu/ and [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1679813 A Fungal Phylogeny based on 42 Genomes]). This user story starts with a large number of genomes as nucleotide sequence, and the aim is to identify orthologous protein sequences in those genomes and cluster them into protein families. The general problem and a solution is described in [[Media:Stajich_UserStory.pdf|this presentation]].
====Genome annotation Applications====
* Prediction programs
** [http://selab.wustl.edu/cgi-bin/selab.pl?mode=software#trnascan tRNAscan]
** [http://homepage.mac.com/iankorf/ SNAP]
** [http://augustus.gobics.de/ AUGUSTUS]
** [http://www.genezilla.org/ Genezilla]
** [http://www.tigr.org/~salzberg/glimmer.html Glimmer]
** [http://genes.mit.edu/GENSCANinfo.html Genscan]
** GLEAN (A. Mackey and D. Roos, unpublished)
* Similarity and Alignment tools
** [http://www.bx.psu.edu/miller_lab/ BLASTZ]
** [http://www.ncbi.nlm.nih.gov/BLAST/download.shtml BLASTN]
** [http://blast.wustl.edu WU-BLAST]
** [http://www.ebi.ac.uk/~guy/exonerate/ exonerate] (est2genome and protein2genome models)
* [http://www.sanger.ac.uk/Software/Rfam/ RFam]
* [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome Genome Sequences]
====Computational Steps====
[[Jason Stajich]] performed the following general steps:
# Generate probable protein sequence matches to the genomes
#* By running gene prediction software (SNAP, AUGUSTUS, Glimmer, Genscan) on genomic sequence.
#* By aligning protein sequence to genomic sequence using [http://www.ebi.ac.uk/~guy/exonerate/beginner.html exonerate (protein2genome)].
#* By aligning EST sequence to genomic sequence using [http://www.ebi.ac.uk/~guy/exonerate/beginner.html exonerate (est2genome)].
#* By using BLASTN or BLASTZ with appropriate query sequences (e.g. cDNA nucleotide sequences).
# Collect all coordinates of all possible protein-to-genome matches as [http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml GFF].
# Use GLEAN as ''combiner'' to unify predictions and create a set of predicted proteins.
# Use both known and predicted proteins in a FASTA ''all vs. all'' comparison to find predicted orthologs.
# Use both known and predicted proteins with TribeMCL to create protein families.
===Analysis of Silent Substitutions Within a Genome or Across Genomes===
Based on the work of Amy Bouck (http://visionlab.bio.unc.edu/), and described in [[Media:Distr_of_Ks.pdf|her presentation]]. The purpose of this work was to determine the rate of silent substitutions (Ks) amongst paralogs in a given gene family within a single genome. Note that the same techniques could be used to study Ks within a gene family taken from different genomes.
These measurements of Ks are revealing when one studies genomes that are formed by whole genome duplication (WGD, frequently encountered in plant species). Each duplication will result in paralogy, and multiple duplications results in paralogs with different ages over evolutionary time. The distribution of Ks in these paralogs should give us estimates of when the duplications have occured, and studying many gene families simultaneously should increase the confidence of these estimates.
* [http://www.ncbi.nlm.nih.gov/BLAST/download.shtml BLASTN]
* [http://www.ncbi.nlm.nih.gov/BLAST/download.shtml BLASTX]
* [http://www.ebi.ac.uk/research/cgg/tribe/ TribeMCL]
* [http://www.igs.cnrs-mrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.html T_Coffee]
* [http://www.ebi.ac.uk/Wise2/ Estwise]
* [http://evolution.genetics.washington.edu/phylip.html Phylip]
* [http://evolution.genetics.washington.edu/phylip/doc/retree.html ReTree]
* [http://abacus.gene.ucl.ac.uk/software/paml.html PAML]
* [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene UniGene]
* [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=protein Genbank Protein]
====Computational Steps====
[[Amy Bouck]] performed the following general steps:
# Create ''collection of paralogs'' for a given family and species
## Identify paralog sequences as nucleotide using BLASTN and Unigene
## Use TribeMCL to cluster paralogs
# Create collection of ''in-frame coding sequences'' for a given family and species
## Identify protein-coding sequences in Unigene using BLASTX and Genbank Protein
## Identify best 3 hits to a given protein and a given Unigene
## Align each of these best hits to a given Unigene using Estwise
## Generate in-frame coding sequences
# Use T_Coffee to align ''collection of paralogs'' and ''in-frame coding sequences''
# Estimate Ks for given alignment using PAML
## 2 member gene families are used as direct inputs to PAML
## Larger gene families are analyzed by Phylip to create gene family trees, then rooted by ReTree
## Ks for larger gene families are estimated by PAML
# Visualize and analyze results
==Hackathon Contributions==
The BioPerl group has worked on use case 3.1, sequence family alignment, and has written a set of [http://bioperl.org BioPerl] scripts that can be used for this purpose. Note that these scripts collect and cluster sequences but do not annotate them. There are a number of design decisions, unspecified within the use case, which [[Jason Stajich|Jason]] has addressed in the course of coding:
* The user might be interested in characterizing all possible families in the input sequences or only one given family, exemplified by some single sequence.
* The workflow could be enacted by one script, or one or more scripts in a workflow. In the case of multiple scripts one needs define inputs and outputs carefully such that the scripts could be chained together.
* Any given ''step'', such as aligning at a given point, could be executed by one of a number of different programs (e.g. clustalw or T_Coffee). It would not be easy to allow the user to choose any executable for a given step so some reasonable decisions have been made.
** Example: Tribe will be selected as the application that will be used to cluster sequences into families.
* There are open questions around speed and sensitivity that can't be resolved during the Hackathon. It is not clear that the choice of applications is optimal for a given user's preferences.
The scripts use the following applications in the given order:
# BLAST (m9 option)
# TribeMCL
# Clustalw
# Phylip
In order to enable these scripts the following Bioperl ''wrappers'' needed to be created or updated:
* {{CPAN|Bio::Tools::Run::TribeMCL}}
* {{CPAN|Bio::Tools::Run::Phylo::Phylip::Base}}
* {{CPAN|Bio::Tools::Run::Phylo::PAML::Baseml}}
* {{CPAN|Bio::Tools::Run::Alignment::Clustalw}}
These scripts will be placed in a dedicated directory in the BioPerl distribution.
The BioRuby group contributed to this use case by creating or modifying the BioRuby modules for running and parsing output from the following applications:
* Phylip
* T_Coffee
The BioRuby group also created a new module for reading and writing MSF format, a common multiple alignment format.
===BioPerl & HyPhy===
The following wrapper modules were created that could be used to run HyPhy from within BioPerl and address the family alignment problem:
* {{CPAN|Bio::Tools::Run::Phylo::Hyphy::SLAC}}
* {{CPAN|Bio::Tools::Run::Phylo::Hyphy::FEL}}
* {{CPAN|Bio::Tools::Run::Phylo::Hyphy::REL}}
* {{CPAN|Bio::Tools::Run::Phylo::Hyphy::Modeltest}}
* Bio::Tools::Run::Phylo::Hyphy::GABranch ''In progress''
* Bio::Tools::Run::Phylo::Hyphy::GARD ''In progress''

Revision as of 20:44, 13 June 2007


This Phyloinformatic Hackathon page describes use cases concerning the characterization of sequence families in a set of related organisms. There are many ways to do this so our examples, though robust and proven by practice, are selected from a multitude of possible examples. The focus is on approaches that use a Bio* toolkit ([BioJava, BioPerl, BioPython, BioRuby) since these packages offer the user different workflow possibilities and practical, re-useable code but home-grown solutions are also discussed.

All of our examples are descriptions of bioinformatic workflows or pipelines. A workflow is comprised of a set of analytical applications, performing the analyses themselves, and some set of scripts that hand data to applications and take results from these same applications. One will also frequently encounter some set of critical filters (e.g. as paraphrases, "only write those sequences to a file that match the query sequence with a p < .0001" or "get all nucleotide sequences > 250 bp in length from the input file"), and these filtering steps may be performed by an application or by a script.

The characterized gene family, either as protein or nucleotide, is not necessarily an endpoint but is frequently the starting point for detailed phylogenetic analyses. For example one use case describes the analysis of the rate of silent substitutions within and between families. Another use case describes the task of comparing a species tree to gene family trees from those same species, and how one might resolve discrepancies between those structures (Reconcile Trees Documentation).

We also describe a set of BioPerl scripts created during the Hackathon that can be used for sequence family creation and alignment.

User Stories

===Analysis of Gene