Phylogenetic Footprinting Documentation

From Phyloinformatics
Jump to: navigation, search

Introduction

This Phyloinformatic Hackathon page describes how one might discover and characterize conserved sequence features shared in related genomes or genomic sequences. Footprinting and Shadowing are similar but typically differ in the number of sequences analyzed and the relatedness of those sequences.

Phylogenetic Footprinting

Phylogenetic footprinting seeks to find regulatory sequences or specific features of non-coding DNA by analyzing DNA within and around aligned, orthologous genes. The assumption is that mutation is more tolerated outside of critical sequence regions, thus sequences conserved between species are likely to be critical sequences. The species being studied need to be sufficently diverged that mutation has acted appreciably yet sufficently related that the relationships between or RNA- or DNA-binding proteins and their corresponding bound sequences have not significantly changed.

Phylogenetic Shadowing

Phylogenetic shadowing differs in that the group of species in question ideally contains both closely related and more divergent species. The idea is that by combining data from various pairwise comparisons one should observe enough mutation that conserved regions can be identified even if there is no exact alignment between the most divergent pairs. Thus part of the result from this type of analysis is a shadow, a conserved region shared by all species that lacks base-pair definition.

Applications

Computational Steps

  1. Identify and extract orthologous sequences through genome synteny or by some other means.
  2. Align orthologous sequences.
  3. Obtain a tree corresponding to the alignment.
  4. Calculate total tree length of the alignment on the tree.
  5. Implement the tree-length calculation in a sliding-window fashion.
  6. Identify regions that significantly deviate from the average tree-length.

Hackathon Contributions

BioPerl

This stepwise description can be done using Clustalw. The BioPerl group has provided a new method, footprint(), in the Bio::Tools::Run::Alignment::Clustalw module, found in the bioperl-run package of BioPerl, which can do either phylogenetic footprinting or phylogenetic shadowing. The Bioperl group has also fixed the run() method in Footprint.pm, found in the bioperl-run package of BioPerl, so this module can now be used for phylogenetic footprinting.

Examples of both approaches are shown below. You'll first need to install BioPerl, bioperl-run, and clustalw.

Download Bioperl and bioperl-run here:

http://www.bioperl.org/wiki/Getting_BioPerl

There are INSTALL files inside both packages, or follow the instructions here:

http://www.bioperl.org/wiki/Installing_BioPerl

Download clustalw here:

http://www.ebi.ac.uk/clustalw/

The installation details are in the package.

Using the Clustalw module

The Bio::Tools::Run::Alignment::Clustalw module is found in the Bioperl run package. There are two different ways of using its footprint method. One way is to use a tree file as input, these are the *.dnd files that Clustalw creates. For example:

<perl>

  1. !/usr/bin/perl -w

use strict; use Bio::SeqIO; use Bio::Tools::Run::Alignment::Clustalw;

my $factory = Bio::Tools::Run::Alignment::Clustalw->new;

my $treefilename = "test.dnd"; my $windowsize = 10; my $diff = 10;

my @alns = $factory->footprint($treefilename, $window_size, $diff); </perl>

The second way is to use an array of Sequence objects as input:

<perl>

  1. !/usr/bin/perl -w

use strict; use Bio::SeqIO; use Bio::Tools::Run::Alignment::Clustalw;

my @seqarray;

my $in = Bio::SeqIO->new(-file => "test.fa"); while (my $seq = $in->next_seq) {

       push @seqarray, $seq;

}

my @params = ('ktuple' => 2); my $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params);

my $windowsize = 10; my $diff = 10;

my @alns = $factory->footprint(\@seqarray, $window_size, $diff); </perl>

Used this way the code will run Clustalw and align all the supplied sequences.

Either use of the method slices those regions out of the alignment whose tree length differs significantly from the total average tree length. The method returns an array of SimpleAlign objects.

There are two optional arguments:

  • The 2nd argument specifies the size of the sliding window (default 5 bp)
  • The 3rd argument specifies the % difference from the total tree length needed for a window to be considered a footprint (default 33).
Using the Footprint module

You can also use Bioperl to run the application Footprinter. First install Bioperl and bioperl-run, directions above, then download and install Footprinter:

http://www.mcb.mcgill.ca/~blanchem/FootPrinter2.1.tar.gz

Note that you must set the environmental variable FOOTPRINTER_DIR, for example:

 >setenv FOOTPRINTER_DIR /usr/local/bin/

You can test your installation using the files "tree_of_life" and "example.fasta" included in the distribution.

Here is example code showing how to use Bio::Tools::Run::FootPrinter using these same files as input:

<perl>

  1. !/usr/bin/perl -w

use strict; use Bio::Tools::Run::FootPrinter;

my @params = ( html => 1, tree => "tree_of_life" );

my @seq;

my $fp = Bio::Tools::Run::FootPrinter->new(@params, -verbose => 1);

my $sio = Bio::SeqIO->new(-file => "example.fasta");

while (my $seq = $sio->next_seq){ push @seq, $seq; }

my @fp = $fp->run(@seq);

foreach my $result (@fp) { print "***************\n" . $result->seq_id . "\n"; foreach my $feature ($result->sub_SeqFeature) {

     print $feature->start . "\t" . $feature->end . "\t" . $feature->seq->seq ."\n";

} } </perl>

Note that the FootPrinter results are stored in what Bioperl terms features. In this context you can simply think of a feature as a sequence with a location. All the features are described with start and end positions located on the input sequences.

There are many parameters that can be used, please see the documentation included with FootPrinter for details (manual.html). In order to use a given set of parameters from within your script you will use the @params array. For example:

<perl> my @params = ( size => 10, max_mutations_per_branch => 4, sequence_type => 'upstream', subregion_size => 30, position_change_cost => 5, triplet_filtering => 1, pair_filtering => 1, post_filtering => 1, inversion_cost => 1, max_mutations => 4, html => 1, tree => "tree_of_life" ); </perl>

BioRuby

The BioRuby group addressed this use case by adding functionality to calculate total tree length to the Bio::Tree class.