PhyloSoC:Estimation of divergence time priors from fossil occurrence data
This project is part of the 2007 Phyloinformatics Summer of Code which is part of the Google Summer of Code project. This web page will serve as the central resource for information related to the project "Estimation of divergence time priors from fossil occurrence data" that is being developed by Mike Nowak with mentorship by Derrick Zwickl.
- This page started --MikeNowak 10:00, 17 May 2007 (EDT)
Paleobiologists and molecular biologists and have been asking the same question for many years: How long ago did this group of species share a common ancestor? Paleobiologists employ the fossil record to address this problem, while molecular biologists use “molecular clock” analyses of DNA sequence data. Molecular clock methods require fossil data to calibrate rate. Although the methods for estimating divergence times have been developed considerably to relax the assumptions of constant rate, the improvement of methods to calibrate rates through fossil data has been given very little attention. Despite this, molecular divergence time estimation has become ubiquitous in systematic biology, but there has been considerable well-founded criticism aimed at the "illusion of precision" that many methods provide. I propose to develop a free software tool to foster the integration of methods developed by analytical paleobiologists and molecular biologists that are both attempting to estimate divergence times through different sources of data. I envision that this will be a multi-functional tool assisting with both the retrieval of data from a paleobiological database and analysis of fossil occurrence data to construct realistic prior distributions that can be implemented in existing Bayesian MCMC molecular divergence time software.
Estimating the time since the most recent common ancestor between two species was once considered a problem limited to the field of paleobiology. Zuckerland and Pauling (1965) were the first to suggest that the amount of DNA sequence divergence between two species could be employed as a metric for time since their most recent common ancestor. The accumulation of mutations in a DNA sequence was likened to the ticking of a “molecular clock”. But simply counting mutations was not sufficient for estimating absolute divergence time between two species, because there was no way of knowing how fast the “molecular clock” was ticking (i.e. the mutation rate). The fossil record is the only source of information to calibrate rate to units of absolute time.
Since the initial conception of Zuckerland and Pauling (1965) the methods for estimating divergence times have been developed considerably to relax the assumptions of constant rate, but the improvement of methods to calibrate rates through fossil data has been given very little attention. Despite this, molecular divergence time estimation has become ubiquitous in systematic biology, but there has been considerable well-founded criticism aimed at the "illusion of precision" that many methods provide (Graur and Martin 2004). Much of the error in divergence time estimation is thought to be due to unrealistic calibrations provided by a literal interpretation of the fossil record. This problem stems from the fact that a literal interpretation of the fossil record can only provide a minimum age for a divergence time, while in molecular divergence time analyses dates from the fossil record are often applied as the exact age of divergence. In response to this perceived misuse of the fossil record by molecular biologists, Foote et al. (1999) and Tavare et al. (2002) developed generally applicable analytical methods to quantify uncertainty in fossil dates based on the temporal distribution of fossil occurrences. These methods use a probabilistic model of fossil preservation in conjunction with a model of cladogenesis to estimate a divergence time for a group of species. Despite the fact that they are addressing exactly the same biological problem as molecular biologists (i.e. dating divergences), these authors failed to integrate their analytical paleobiological methods with those of molecular divergence time estimation. Recently developed software employing Bayesian Markov Chain Monte Carlo (MCMC) methods shows the promise of facilitating this integration through the application of prior distributions on divergence times (Yang and Rannala 2006, Drummond et al. 2006). Although both the analytical framework for constructing prior distributions that incorporate error inherent in the fossil record, and the software for the implementation of these prior distributions have been developed, the prospect of integration is yet hindered by the fact that there is no publicly available software to actually calculate prior distributions given a dataset. Furhermore, the process of collecting the necessary occurrence data on a per-taxon scale from the paleobiological literature is extremely time consuming. Fortunately, this data is available in a public database that has been developed as the primary repository of all paleobiological data (the Paleobiology Database, http://paleodb.org/).
I propose to develop a free software tool to foster the integration of methods developed by analytical paleobiologists and molecular biologists that are both attempting to estimate divergence times through different sources of data. I envision that this software will function in two ways:
- Automate retrieval of fossil occurrence data from the Paleobiology Database.
This will be accomplished by querying the MySQL database for occurrence data belonging to the taxonomic group (i.e. node) of interest.
- Analysis of fossil occurrence data to construct realistic prior distributions that can be implemented in existing Bayesian MCMC molecular divergence time software.
With occurrence data on hand from the Paleobiology Database, I plan to first implement the method developed by Foote et al. (1999) to calculate the divergence time priors. When this is up and running, I plan to expand the software to include more complex models of fossil preservation and cladogenesis in the calculations of divergence time priors.
Timeline of Goals
May 17 - Start
- Set up project wiki
- Started the project blog.
- Working on reading in and parsing occurrence data from the PBDB
- Design class structure
- Parser class
- Occurrence class
- Lineage Class
- NodeOfInterest Class
- Explore parser class code written by Derrick
- Augment parser class to add error handling
- Explore occurrence class code
- Complete Parser and Occurrence classes
- Add Comments
- Compile and debug
- Test with example PBDB dataset
- Write Lineage header and class
- Start writing NodeOfInterest header and class
- Complete Lineage and NodeOfInterest classes
- Compile, debug and test with example PBDB dataset
- Construct lookup table for PBDB 10My bins to absolute time.
- Write CalcStratRange() function in Lineage Class
- Write functions for calculating origination and extinction rates for a given node of interest
- Use origination and extinction rates with Raup's (1985) method to calculate the likelihood of the age of the MRCA given these rates and the observed extant diversity
- Rearrange Foote's (1999) equations to suite our specific problem (i.e. known extant diversity, and an estimate of standing diversity at one point in the past)
- Cannibalize Derrick's old code for MCMC and build it to suit this problem
- Do the same for Derrick's old code for fitting distributions by least squares
- Work on coding likelihood calculations
- Set up the PBDB query system
- Determine the necessary options to make available to user
- figure out how to use libcurl to do http requests
- Have basic working prototype
- Query PBDB
- Parse data
- Perform analysis under most basic models
Midterm evaluation completed.
- Refine user interface to allow more flexibility in data download
- Start coding the more complex analysis options
- providing the ability for the user to delete certain lineages or occurrences
- this will be a simple command line query to allow erroneous entries in the PBDB to be ignored by the user
- decide the best estimate of perservation probability to use
- write the code to implement this calculation and use this to establish a prior on the number of lineages in the first bin of the clade of interest
- Integrate the preservation probability into the equations calculating the likelihood of a missing fossil record for a given clade
- Write the code to enable the user to manually edit the dataset from the paleobiology database.
- Finishing code to allow users to manually edit pbdb data sets.
- Putting everything together and improving the user interface.
- Running several data sets to test the limits of the program.
- Perform analysis under complex models.
- Test application by naive users.
- Upload code for final evaluation.
- Deadline for final program evaluation.
The following references are relevant to the goals of this project. :
Durmmond, A. J., S. Y. W. Ho, M. J. Phillips, A. Rambaut. 2006. Relaxed phylogenetics and dating with confidence. PLOS Biology 4:0699-0710.
Foote, M. J. P. Hunter, C. M Janis, J. J. Sepkowski. 1999. Evolutionary and Preservational Constraints on Origins of Biologic Groups: Divergence Times of Eutherian Mammals. Science 283:1310-1314.
Graur, D. and W. Martin. 2004. Reading the entrails of chickens: molecular timescales of evolution and the illusion of precision. Trends in Genetics 20(5):242-247.
Tavare, S. C. R. Marshall, O. Will, C. Soligo, R. Martin. 2002. Using the fossil record to estimate the age of the last common ancestor of extant primates. Nature 416:726-729.
Yang, Z. and B. Rannala. 2006. Bayesian estimation of species divergence times under a molecular clock using fossil calibrations with soft bounds. Molecular Biology and Evolution 23: 212-226.
Zuckerland E. and L. Pauling. 1965. Evolutionary divergence and convergence in proteins. In: Bryson V., Vogel H.J. (eds), Evolving genes and proteins. Academic Press, New York, pp. 97-166.