PhyloSoC:Phylogenetic and Haplotype Displays for GBrowse
As genomic data from an increasing number of species become known and annotated, the demand to efficiently manage and display the data will increase. The Genome Browser is an intuitive way to display information for one organism by stacking different tracks of information along the sequence in a genome. Through the Google Summer of Code (GSoC) program, it is proposed that genetic haplotypes, multiple sequence alignments and phylogenetic trees be added as new information tracks to the browser. It will be envisioned that genomic data for a number of specific tracks for a number of species can be efficiently and intuitively by weaved together giving greater insight to the phylogeny of different organisms.
Main Objectives / Project Plan
- Understand the existing and related code. (Weeks 1-2)
- Split the single large image produced by the program to a number of tracks (eg. Genes, SNPs, GC content, etc.), where each track is loaded only when requested. Lincoln D. Stein has already started on the implementation and will work together with the student. (Weeks 2-5)
- Work on a track for multiple alignments. It should tackle duplicates, inversions, gaps and deletions. Web Miller’s Multiz program as well as the MAF (multiple alignment format) will be investigated. (Weeks 6-9)
- Create a track to view a phylogeny tree in line with the genome browser. It would be ideal to paint the tree based on the organisms’ multiple sequence alignments. The ATV format is seen frequently so it be looked at as a potential format to be used. (Weeks 11-12)
Individuals and Organizations involved
Mentoring Organization: National Evolutionary Synthesis Center (NESCent), Phyloinformatics Group
Student: Hisanaga Mark Okada
Mentor: Lincoln D. Stein
All code for the Genome Browser as well as changes made through this project will reside on the Generic Model Organism Database Toolkit (GMOD) site on Sourceforge.
- setting up the Genome Browser environment so that the stable version (v 1.68) runs
- playing around with the unstable head version
- trying to get to know how the data is processed and stored for this system
- getting my feet wet with the existing code base,
- trying out the tutorials to get myself comfortable with the file system,
- and was frantically trying to get the latest version of gbrowse to run. I had many problems with this (DBI, DBD::mysql, mysql) using a number of days to try to get around the problem. It, more or less, works and I would feel relieved if I wasn't so behind schedule.
- familiarising myself with the genome browser
- working on debugging gbrowse_run, a currently unstable version of gbrowse
Although my progress isn't exactly tangible, I'm making progress in understanding the code and have worked on where the possible bug is in the code (gbrowse_run) with Lincoln in our last conversation. Until the newest version is stable I cannot make changes to the code.
Last week Lincoln and I have been troubleshooting the bugs that exist in GBrowse and I'm happy to report that it is finally running, at least the bulk of it is.
- doing some minor fixes to stabilise the gbrowse_run script:
- making sure the data glyphs are drawn on separate levels (bumping))
- making sure the data files are read correctly (some minor data label
- differences between the older and newer version)
- trying to make GBrowse render the Overview track. I'm following the old
- code from the last stable version but it's a bit tricky as some things
- aren't structured the same way
Last week I was working on the 'overview' track of the Genome browser which displays a zoomed out view of the current window. It is not implemented but a lot (if not all) of code from the previous and current versions can be reused. Lincoln and I discussed the call structure to be used and I will continue with that this week. I have also been looking into / working on the 'remote rendering' where the Genome Browser can delegate different tracks of data to be rendered on multiple servers using the LWP::Parallel library.
I started planning how to set up the gbrose_render script which is to create the gbrowse images remotely. I've started implementing that around Lincoln's skeleton code. I will continue with that this week.
I was working on the GBrowse remote rendering which allowed image rendering/processing on different servers. It works now and hopefully it's usable enough. It seems a little too simple.
- worked on the Overview bar for Gbrowse, it works but it's ugly code
- went through the tutorials on Graphics and Glyphs on the bioperl.org website
Last week I've been working on the visualisation track that is responsible for the DNA alignment / cladogram of different species. Combining the different data types, using the language of GFF3 and making use of the huge code repository has proven to be much more complicated than anticipated. I hope I can get it working soon in time for the meeting and the newsletter.
Last week I was
- Working on the cladeogram / DNA-alignment viewing track. It displays a cladeogram on one side of the viewing window and draws either the DNA alignments or a simple histogram representing the conservation score depending on the zoom level. The alignment data it reads is from the GFF3 format as described in the Sequence Ontology webpage. Although the core of the track is made, there are a few kinks to fix.
- The conservation scale bar graph still looks a bit empty. This is probably because my data set is extremely small and I cannot find real world data that uses GFF3 alignment data. It is currently a log scale to accommodate the exrtemely small scores.
- I am also working on my presentation for this week's get together in NC.
- Lincoln has listed a few alternatives that I should look into to render the conservation scores. Namely they are his DenseFeature code and the wiggle data.
GSoC / Genome Browser
- Google Summer of Code
- Phyloinformatics Summer of Code 2007
- Generic Model Organism Database Toolkit (GMOD) - Genome Browser
- Sourceforge - GMOD
Tutorials / Data Descriptions
- Bioperl Wikipages. Very useful information on Bio::Graphics, Bio::Graphics::Glyphs, and Bio::Tree
- Sequence Ontology webpage explanation of the GFF3 format
Other Genome Browsers
- UCSC Genome Browser for the H. Sapiens. Very cool pairwise and multi-species conservation tracks.
- The Nexplorer Family Plot. Basically what I'm trying to do when the browser is zoomed in close enough to see the bases.
Papers of Interest
- Adam Siepel, Gill Bejerano, Jakob S. Pedersen, Angie S. Hinrichs, Minmei Hou, Kate Rosenbloom, Hiram Clawson, John Spieth, LaDeana W. Hillier, Stephen Richards, George M. Weinstock, Richard K. Wilson, Richard A. Gibbs, W. James Kent, Webb Miller and David Haussler (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes Genome Res, 15, 1034–1050