UseCases

From Phyloinformatics
Jump to: navigation, search

In order to edit this page, click on the edit tab at the top. If you do not see the edit tab, please login first (click here or on the words "log in / create account" at the top right of this page). The login page allows you to register if you do not yet have an account.

Contents

Introduction

This page describes a variety of use cases conceived prior to the hackathon or presented during the hackathon.

Gap analysis and identification of toolkit-specific targets during the hackathon

Critical Use Cases

On day 1 of the hackathon, we clarified, re-factored, and prioritized the use cases, resulting in a list of five. A request for additional use cases (not mentioned in the original list) resulted in the addition of the NEXUS support case. Here is the set of six use cases chosen as targets for the hack-a-thon:

  1. Family alignment: identify homologues, generate family alignment, evaluate models (3.2, 3.18)
  2. Reconcile trees (3.19) and Determine concordance between two or more phylogenies (3.15)
  3. Phylogenetic footprinting/shadowing (3.10)
  4. Morphological characters: infer tree (3.12) and calculate support values (3.13)
  5. Estimate divergence times (3.17)
  6. NEXUS: lack of support for (and adherence to) standards

Gap Analysis

Then we broke out into toolkit-specific groups (BioPerl, BioJava, BioSQL, BioPython, BioRuby) to carry out a gap analysis. Each group was tasked with identifying opportunities to build support for critical use cases using their toolkit.

See the Targets page for the results.

Critical Use Case Documentation

Phyloinformatics Documentation

Use Cases

2nd ToDo List (11/14/2007 EvoInfo workgroup 2nd meeting)

  • Add at least one test case and reference to each use case
  • Identify interoperability problems by manually testing each case
  • The "moving target" Challenge: old problems are being solved, new questions and methods continuously emerge. How to keep our Use Case list updated?

1st ToDo List:

  • Collect test datasets (Status: an initial group of test datasets found)
  • Re-factor 19 use cases as in the target selection above. It seems these use cases can be re-factored into following "use-case modules":
    • Obtain family tree (pre-requisite for almost all cases; covers cases 1, 2)
    • Calculate evolutionary rates (covers cases 2, 4, 10, 18)
    • Calculate tree congruence (covers cases 9, 15)
    • Comparative analysis using a tree (covers cases 3, 8, 16, 17, 18, 19)
    • Population analysis (covers cases 6 and 7)

Introduction: In developing software, we want results that will be *useful*. So, we start by asking "what does the user want to *do*?", with the aim of identifying specific problems in the domain of interest, ranging from every-day chores to complex tasks that cannot be accomplished with existing software. From studying these problems, sometimes called "use cases", we identify design criteria, then consider how to build a system. If the process works, we will end up with software that solves the problems posed by the use cases.

Here, the domain is evolutionary analysis, and our focus is not to support end-users directly (by building applications software), but to support them indirectly by providing a library. Below it is assumed that the developer and the analyst are the same person, i.e., the "user" writes software to facilitate an evolutionary analysis task, then applies this software to the specific task.

Family alignment: identify homologues, generate family alignment

  • Background: This is a core step in many analyses, although there is a school of thought that alignment and tree inference are inseparable.
  • Key challenges/Problems identified:
    • coordinate data on sequences (aa and nt)
    • allow flexibility in search and alignment (e.g. gap parameters)
    • filter data (e.g., restrict to certain taxa groups or remove divergent sequences)
    • remove paralogs from datasets generated from blasting genomes ("phylome")
    • standardize taxa names/ids as a compromise between human readability and format restrictions. Many datasets use uninformative abbreviations, inconsistent fields among OTUs, long names that would be corrupted during subsequence alignment and phylogeny reconstruction, machine-dangerous characters such as space and punctuations, and provide no information to trace the data source (e.g., genbank accession)
  • Preconditions: user has a query sequence representing a protein-coding gene.
  • Steps:
    • Method 1. User dowload pre-computed sequence family from databases. Example: PFam PF00046.
    • Method 2. User search database using a sequence profile. Example: Prosite #PS00032.
    • Method 3. User search database based on structural profiles. Example: HOMSTRAD Kinase.
    • Method 4. User search database using a query sequence:
  1. read amino acid sequence
  2. identify homologs via database search (e.g., blast)
  3. download data on homologs
  4. filter dataset according to taxa limits or sequence divergence
  5. compute protein alignment
  6. download DNA sequences
  7. compute codon alignment according to the protein alignment
  8. user runs script with desired parameters
  9. user visualizes output with alignment viewer or formatter
  • References:

Rokas, Williams, King and Carroll. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425, 798-804 (23 October 2003). (Note: yeast genome phylogeny using 106 orthologs)

Sequence family evolution: infer sequence family tree and detect positive selection

  • Background: a very common type of analysis, uses only sequence data and statistical models.
  • Key challenges:
    • multiple programs needed (tree-finders do not do the dN/dS calc)
    • coordination (use tree from aa alignment, but dN/dS from nt alignment)
    • lack of developed interfaces (PAML's is lame; could make a flexible one for HyPhy)
  • Preconditions: user has a protein sequence alignment sequences.[ aln | fas | nex | msf ] and a corresponding set of nucleotides sequences. Alternatively, the user has a coding region sequence (CDS) alignment with a genetic code for each sequence.
  • Steps:
  1. user writes script to carry out automatable steps:
    1. read amino acid sequence alignment
    2. read corresponding codon sequences
    3. select a subset of alignment as desired
    4. infer tree from selected subset of amino acid sequence alignment
    5. compute dN/dS from selected subset of codon alignment
  2. user runs script with desired parameters
  3. user uploads output to spreadsheet for analysis or visualization

Modeling character evolution: kinase inhibitor sensitivities

  • Background: We can obtain sequences for a large family comprising several hundred human kinases, along with inhibitor data. We wish to understand inhibitor sensitivity from a phylogenetic perspective
  • Key challenges:
    • integrating sequence, phylogeny, structure and activity data (e.g,. name conflicts)
    • modeling IC50 evolution as a continuous character
    • visualization and data management issues due to large family size
  • Preconditions: user has kinase sequence data; user has IC50 data for a large set of kinases in a parseable text format. User knows the binding pocket residue positions for at least one sequence.
  • Steps:
  1. user writes script to compute alignment, get likelihood of IC50 data:
    1. read amino acid sequences
    2. compute alignment
    3. optionally read binding pocket residue positions
    4. optionally get subset of alignment corresponding to binding pocket sites
    5. infer tree from chosen (whole or subset) alignment
    6. read IC50 data
    7. re-scale IC50 data as needed
    8. compute likelihood of IC50 data given tree
  2. user runs script with desired parameters to get likelihood based on whole alignment
  3. user runs script with desired parameters to get likelihood based on binding pocket subset
  4. user uploads output to tool for visualizing ancestral states

Functional inference: identifying "functional" sites by "evolutionary trace" and related methods

  • Background: There are several contrast methods to detect sites whose patterns of evolution change from one phylogeny-based or class-based subset to another. Such sites are candidates for specificity-determining sites where "function" has shifted during evolution. The most popular approach is "Evolutionary Trace", but the most sophisticated is the evolutionary approach of Gu.
  • Key challenges:
    • need to develop interfaces to functional shift software
    • most methods are intended to be interactive, so an interactive interface is favorable
    • need to develop methods for visualizing results
  • Preconditions: User has sequence data for a family of homologous proteins or protein-coding genes.
  • Steps:
  1. user writes script to compute alignment and tree by desired methods
  2. user runs script with desired parameters to compute alignment and tree
  3. user develops graphical interface to multiple functional inference methods (CHAIN chain2004, ET (Evolutionary Trace) ET2006, DIVERGE DIVERGE2003)
  4. user interactively analyses sequence family for evidence of functional shifts.

Structural genomics: identifying Giardia protein targets for structural characterization

  • Background: Giardia lamblia is a single-celled eukaryote that can cause life-threatening instestinal disruption in humans and thus is a target for development of antimicrobial drugs. Rational drug design strategies benefit from 3D protein structures. Thus, we wish to identify the Giardia proteins that are most likely targets for structural characterization prior to rational drug design. The best targets will be essential to Giardia (so that inhibitors will be lethal), dissimilar to human proteins (so that inhibitors will not cross-react), and readily crystallized (so that the structure can be determined by x-ray crystallography).
  • Approach: start with the whole annotated genome; find homologs, and add data on essentiality in other organisms (worm, yeast, fly); create alignments and trees for every family; identify orthologies and paralogies; implement metrics for essentiality, distinctiveness, and crystallizability.
  • Key challenges:
    • genome-wide analysis with thousands of different families
    • integrate external sources of data of various types

Human variation: handicapping population SNPs as potential disease alleles

  • Background: Humans differ from each other, mainly by single-nucleotide differences called SNPs. Because SNPs are readily characterized by sequencing, they provide a basis for analyzing the significance of human genetic differences with respect to disease risks and other individual health issues (e.g., response to drugs)
 This is outside the scope of the current hackathon but will be addressed in a future one --Tjv 10:26, 21 October 2006 (EDT)
  • Key challenges:
  • Preconditions: user has online access to SNP data and sequence databases
  • Steps:
  1. user writes script to populate a SNP database with sequence family information
    1. for each SNP, identify genic context
    2. for SNPs in a gene
      1. get GO terms
      2. get expression data
      3. identify homologs (e.g., BLASTP for coding, BLASTN for non-coding)
      4. compute alignment & phylogeny
      5. compute parameters for site-specific model
  2. user runs script, creating database
  3. user queries database
    1. sample query
    2. another query

Population analysis: to be determined

  • Background: Weigang has suggested that we should pay more attention to this subject, given its importance. In the SNPs case above, we treat the SNPs as a collection of separate site changes that can be linked to the genome (gene, protein) sequence but have no other structure. But most population analyses introduce the complications of having multiple populations, diploidy, and linkage effects. Pedigrees and coalescents used in population analysis may be treated in some ways differently from trees.
 This is outside the scope of the current hackathon but will be addressed in a future one --Tjv 10:27, 21 October 2006 (EDT)

Examples of common target problems in population analysis would be

  • determining whether two population samples are from the same underlying population
  • computing the coalescence time given a population sample
  • determining the extent of linkage disequilibrium between two loci

Those unfamiliar with these cases might refer to the user manual for Arlequin, or look at Bio::PopGen, Bio::Variation and Bio::Pedigree::Marker::variation.

Molecular evolution: differences in data sets and methods in intron studies

  • Background: Conflicting analyses of intron evolution lead to questions that are hard to answer. In one approach, Rogozin et al rogozin2003 apply Dollo parsimony to 684 KOGs families (orthologs only, one each per family from 8 complete genomes), and the results suggests a large number of introns ancestral to plants, animals and fungi. In another approach, Qui qui2004 apply a Bayesian time-asymmetric model to data for 10 large gene families (orthologs and paralogs, using all available data), and the results sugggest a very low probability of presence for introns in deep branches of the trees. Several ML approaches have also been formulated, including Roy and Gilbertrg2005a, Nguyen et al. nguyen2005, and Csuros (also see http://www.iro.umontreal.ca/~csuros/introns/ ) or arbitration Ukraine

Do the studies really conflict? If so, how much is due to different methods, and how much to different data sets?

  • Key challenges:
    • data management issues arising from multiple data sets, methods
    • various reconstruction methods including unusual Bayesian time-asymmetric model
    • need for automated method to identify taxonomically defined nodes, e.g., PFA ancestor
  • Preconditions: user has access to data sets.
  • Steps:
  1. user develops database schema for reconstruction results
  2. user writes script to apply Bayesian 2-state model to KOGs data, output to database
  3. user runs script to apply Bayesian model to KOGs data and populate database with results
  4. user writes script to identify Plant-Animal-Fungal ancestor (and other relevant ancestors) in Qiu et al families
  5. user runs script to identify Plant-Animal-Fungal ancestor (and other relevant ancestors) in Qiu et al families
  6. user writes script to apply Dollo parsimony to Qiu, et al data, output to database
  7. user runs script to apply Dollo to Qiu, et al data and populate database with results
  8. user queries database to resolve differences
    1. what fraction of intron sites (expected fraction) populated with introns in PFA ancestor?
    2. how many instances (expected instances) of parallel gain?

Tree of life: whole-genome phylogeny and horizontal transfer

  • datasets with large numbers of taxa and large numbers of characters
  • anastomising trees

Phylogenetic footprinting/shadowing

  • Background: Conservation of orthologous sequences is a signal of functionality. Identification of functional genome elements based on sequence conservation is most often used for
    • Identifying gene structure (e.g., splice junctions, start codon position, ORF/exon identification)
    • Identifying non-protein coding functional elements (e.g., microRNA, promoters, and enhancers)
  • Approach:
  1. Identify and extract orthologous sequences (through genome synteny)
  2. Align orthologous sequences
  3. Obtain a tree (maybe at the same time of alignment)
  4. Calculate total tree length of the alignment on the tree
  5. Implement the tree-length calculation in a sliding-window fashion
  6. Identify regions significantly deviates from the average tree-length
  • Challenges:
    • Most applications have been the simplified, non-tree-based approach of using pair of sequences (e.g. Human-Mouse). Use of multiple-species comparisons, a more powerful approach, is rare.
    • Alignment uncertainty.

Bayesian supertrees: build composite estimates of phylogeny from bipartition frequencies

  • Background: Supertrees are trees whose topology is derived from a set of partially overlapping, smaller, "source" trees. Different methods of supertree construction are available. A recently suggested approach is to build supertrees by defining specific priors from bipartition frequencies in source trees.
  • Key challenges:
    • handle large numbers of trees, transform trees in different formats
    • communicate in non-standard ways with MrBayes?
  • Preconditions: user has a set of source trees.
  • Steps:
    1. user writes script to carry out automatable steps:
      1. parse tree files
      2. extract bipartition frequencies from source trees
      3. create MrBayes commands to specify priors
      4. run MC3 analysis
      5. post-analysis (sumt)
    2. user runs script with desired parameters
    3. user visualizes output with tree viewer

(by Weigang Qiu, original at molevol.org)

Inferring a phylogenetic tree on morphological characters

  • Motivation: Morphological (or any phenotypic) character change represents the observation of phenotype evolution. Can we infer the phylogeny of a set of taxa based on their observed phenotypes.
  • Precondition: user has a spreadsheet-type matrix of character states (one character per column) and taxa (rows)
  • Steps:
    1. choose program to run analysis with (PAUP, TNT, Phylip, MrBayes)
    2. format character state matrix for program (e.g., create NEXUS file)
    3. pass commands, parameters, and analysis options (e.g., branch&bound, heuristic, #iterations, etc) to program, e.g., by encoding it into the NEXUS file
    4. run analysis, e.g., maximum parsimony, as specified in parameters or NEXUS file
  • Results: tree (e.g., in Newick format) that can be visualized (e.g., using TreeView)

Calculating support values for a phylogenetic tree inferred from morphological characters

  • Motivation: An index of relative support among the nodes within a phylogeny is often desired, but morphological data does not typically meet the requirements of the bootstrap (a large number of characters selected randomly from the universe of all possible characters). Bremer support values (Bremer 1988), also known as decay indices, offer one alternative to the bootstrap. How can one calculate Bremer support?
  • Precondition: user has completed a tree inference and has obtained a consensus of most parsimonious (equally optimal) topologies and knows the length (# steps) of each most parsimonious tree
  • Approach One (as executed via TreeRoot (Sorenson 1999) in PAUP*):
    1. Identify all N nodes within the consensus topology.
    2. Define a converse constraint for each of the N nodes
    3. Run the original phylogenetic inference N times, once for each of the converse constraints, and record the length of the shortest tree(s) do not contain each of the nodes in the original consensus topology.
    4. The Bremer support for each node = the length of the shortest tree(s) that do not contain that node - the length of the shortest tree(s) from the unconstrained analysis.
  • Approach Two (as executed in TNT (Goloboff et. al. 2000)):
    1. Assemble a large number of possible topologies of varying treelengths, typically by running a heuristic search and saving trees that are suboptimal by a specified number of steps or less.
    2. Determine Bremer support for all nodes in the phylogeny from this single set of trees. Support for each node = the length of the shortest suboptimal tree that does not contain that node - the length of the overall shortest tree.
    3. Approach Two tends to inflate support values, particularly large support values, but is significantly faster than Approach One.
  • Possible modifications:
    1. One may wish to run multiple analyses for each of the converse constraints to ensure that the shortest trees have been found. TreeRot defaults to 20 repetitions of the analysis for each of the N nodes.
    2. One may wish to calculate Partioned Bremer Support (e.g. Baker and DeSalle 1997, Baker et. al. 1998) to determine the relative contribution to the support for each node provided by each of several data partitions. TreeRot can calculate Partitioned Bremer support.
  • Results: Support values (decay indices) for each node in the original phylogeny.

Inferring a phylogenetic tree or supertree on multiple datasets

  • Motivation: Especially in synthetic research, a frequent task is to combine different data (character state) matrices from the literature and merge those into one matrix, or perform a super-tree analysis.
  • Precondition: user has multiple spreadsheet-type matrices of character states (one character per column) and taxa (rows)
  • Steps:
    1. reconcile columns of data matrices through an ontology
    2. merge multiple matrices into a single matrix
    3. subsequent steps as in the case of a single matrix

Determine concordance between two or more phylogenies

  • Motivation: Can we determine co-evolution (divergence, e.g., speciation) between the taxa in one clade versus the taxa in another clade, e.g., because of species interaction. A frequently encountered example is host-parasite interaction.
  • Precondition: user has two or more phylogenies (trees), and a table that matches (associates) leaves in one tree to those in another tree.
  • Steps:
    1. align trees by running Component (R.Page) to minimize the number of switch/shift events (e.g., number of host-switches)
  • Results: phylogenetic scale of concordance (e.g., family level, species level?), time-wise order of shift events (e.g., do parasite divergence events follow host divergence events)

Determine correlation of a phylogeny with other variables

  • Motivation: Can we hypothesize certain variables, e.g., ecological, climate, or geographical, to be (causally or not) correlated with a given phylogeny?
  • Precondition: user has a phylogeny (tree), and optionally variables in a table (one variable per column, taxa in rows)
  • Steps:
    1. possibly obtain ecological or geographical (lat/long) data through GBIF for all or a subset of the taxa in the tree (species-based data) or for the specimens (specimen-based data)
    2. run Phylocom (continous and at most one discrete variable), CAIC (continous and at most one discrete variable), or Discrete (only discrete variables) (note: all of these except Phylocom need bifurcating tree)
  • Results: phylogenetic independent contrasts analysis result, sister taxa results for subsequent analysis in statistical software

Estimate divergence times

  • Motivation: The user wishes estimate the time of occurrence of a past evolutionary event. Typically this is a divergence of organismal taxa (e.g., time of divergence of cats and dogs), but it could be another type of event (e.g., time of ancestral hemoglobin-myoglobin duplication).
  • Background: Most evolutionary analyses lack an absolute time scale. However, the lengths of branches on an inferred phylogeny represent the accumulated change in a time-dependent process, thus a phylogeny with branch-lengths contains information about the relative timing of events. The key problem is to relate this information on relative timing to an absolute time scale by taking advantage of calibration points from fossil or other paleontological or archeological dating. For instance, if our phylogeny contains a cat-dog (felidae-canidae) divergence, which is estimated from fossil data to be 60 MYA (million years ago), then we can use this to assign tentative dates to other nodes in the tree. See Time Tree for practical examples; see MCMCTree yang2006 for a calibration algorithm.
  • Key challenges: Data on calibration points are embedded in text representations such as research papers; calibration points have uncertainty; inferred dates have additional uncertainty; techniques for simulating interpolated dates, and for generating composite date estimates from disparate data sources are not well developed.
  • Precondition: user has a phylogeny (tree) with branch lengths, and a table of species and calibration points as a ranges of time (upper and lower bounds). In addition, molecular data (i.e. sequence alignments, maybe clock-like) might be used to constrain simulations or derive additional relative date estimates.
  • Steps:
    1. Possibly use Phylomatic for branch-length adjustments
    2. run r8s sanderson2003 or multidivtime with penalty parameters

Determine genome-wide distribution of Ks (silent site substitutions)

  • Motivation: There is recent and growing interest in analysing the distribution of Ks between paralogous gene sequences within a genome. The distribution of Ks can be used to study the occurrance of past whole genome duplications and as a means of examining broad-scale patterns in gene genome evolution. This analysis can be done for any genome for which an extensive EST or full-length cDNA collection is available.
  • Precondition: user has a large dataset of clustered EST (unigenes), full-length cDNAs, or cds.
  • Steps:
    1. Identifies paralogs/gene families
    2. For each gene family, in-frame, multiple alignments are created and gene trees are generated
    3. Pairwise Ks is estimated for each node in the gene tree (trees are "halved" to prevent double-counting gene duplication events)
    4. Ks values are collated, analysed and plotted.

Tree reconciliation (orthology analysis)

  • Motivation: The user wishes to reconcile a gene (or protein) tree (for a specific gene or protein family) with a species tree (for the set of species comprising the gene family members) in order to identify events of gene duplication, gene transfer and gene loss. This information is useful for problems of functional inference, e.g., the function of the gene is more likely to shift following a duplication.
  • Background: The phylogeny inferred for a set of genes, Pg, even if it is perfect, does not necessarily match the phylogeny Ps for the species carrying these genes, because duplication, loss and transfer of genes may occur for individual gene copies separately from the whole genome. For instance, tetrapods have alpha and beta hemoglobins, as well as myoglobins, so the tree of globin genes (or proteins) for 4 tetrapod species (e.g., human, rat, frog, chicken) will have at least 12 terminal branches (not 4) and typically will have more than 12 due to lineage-specific duplications. The lineage-specific loss of an ancestrally duplicated gene, or the horizontal transfer of a gene from one species to another, also contributes to discordance between the gene tree and the species tree. The reconcile tree problem is to take Pg and Ps as input, and generate Pr, the reconciled tree with annotated duplications, losses, and transfers. A related problem to reconciling a gene tree with a species tree is to infer a species tree Ps from multiple gene trees Pg1, Pg2, etc.
  • Key Challenges: in practice, gene tree(s) and/or species tree are often unresolved (most software can only deal with fully resolved input trees); most software requires rooted input trees, but the midpoint rooting method commonly used for this is problematic; current algorithms allow for duplication and loss, but not horizontal transfer; current software uses a variety of input and output formats
  • Precondition: user has gene tree and species tree in Newick (or NHX or Schreiber) format. Species names or identifiers in species tree have to be substrings of gene/sequence names
  • Steps: for available software, see TreeReconciliation
  1. adjust input format for gene tree(s), species tree, sequences, etc
  2. customize software (e.g., softparsmap) for data input (e.g., remove requirement for GI Number for sequences, allow sequence names to consist of integers and numbers)
  3. run software
  4. parse output if necessary

References

<biblio>

  1. qui2004 pmid=15014153
  2. rg2005a pmid=15687506
  3. rogozin2003 pmid=12956953
  4. chain2004 pmid=15273306
  5. ET2006 pmid=16894615
  6. DIVERGE2003 pmid=12868604
  7. nguyen2005 pmid=16389300
  8. yang2006 pmid=16177230
  9. sanderson2003 pmid=12538260
  10. sorenson1999 M.D. Sorenson (1999). Treerot. Version 2. Boston University. Boston, MA.
  11. baker1997 Baker, R.H., and R. DeSalle. 1997. Multiple sources of character information and the phylogeny of Hawaiian Drosophilids. Systematic Biology 46:654-673.
  12. baker1998 Baker, R. H., X. B. Yu, and R. DeSalle. 1998. Assessing the relative contribution of molecular and morphological characters in simultaneous analysis trees. Molecular Phylogenetics and Evolution 9:427-436.
  13. bremer1988 Bremer, K. 1988. The limits of amino acid sequence data in angiosperm phylogenetic reconstruction. Evolution 42:795-803.
  14. swofford2002 Swofford, D. L. 2002. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods), version 4.0. Sinauer Associates, Sunderland, Massachusetts.
  15. goloboff2003 Goloboff, P., J.S. Farris and K.C. Nixon. 2003. T.N.T. Tree Analysis Using New Technology. version 1.0. link

</biblio>

Sample Data Sets

Major phylogenetic data repositories:

Data sets that could be used in the use cases above
Name Taxa Gene/Domain Data Type Link Notes
PF00046 138 metazoan Homeobox Protein PFam Pre-computed Pfam seed alignment
PS00032 776 metazoan Homeobox Protein Prosite Pre-comptued profile-based family
Kinase 111 taxa Amino acid kinase Protein Homstrad Pre-computed structure-based family
Whippo 8 Artiodacty (ungulates) and 4 Cetacean (whales) 17 datasets Protein, DNA, non-coding, mitochondrial, character BioQuest, whippo1 from SysBio, whippo2 from SysBio Expert compiled datasets, with outgroup.
Borrelia 7 Borrelia burgdorferi (Lyme disease bacteria) strains ospA Protein, Codon Protein alignment, Codon alignment Expert compiled datasets, with a codon alignment. Good for testing gene orthology/paralogy, as well as for testing positive selection


Intron 27 taxa Mn/Fe-SOD Protein, Codon, Binary character Protein alignment, Codon alignment, Intron character matrix Expert compiled datasets, with codon alignment. Good for character reconstruction, testing gene orthology/paralogy
Eutherian 11 eutherian orders 4 mitochondrial and 4 nuclear genes Nucleotide Springer_etal SysBio Expert compiled datasets. Good for testing tree inconsistency among genes


Multiple protein coding Various 7738 protein families Nucleotide/Protein 58.5MB gzipped tarball in a custom format (format description) Automatically generated and hand-curated sequence alignments useful for studying the evolution of protein-coding sequences
Interferon 6 Eutherian Mammals Interferon gene families Nucleotide/Coding Downloadable NEXUS alignments Hand curated alignments of members of the interferon gene families in 6 eutherian species. Useful for the study of gene family evolution and gene conversion ([1])