PhyloSoC:Phylogenetic and Haplotype Displays for GBrowse

From Phyloinformatics
Jump to: navigation, search

Introduction

As genomic data from an increasing number of species become known and annotated, the demand to efficiently manage and display the data will increase. The Genome Browser is an intuitive way to display information for one organism by stacking different tracks of information along the sequence in a genome. Through the Google Summer of Code (GSoC) program, it is proposed that genetic haplotypes, multiple sequence alignments and phylogenetic trees be added as new information tracks to the browser. It will be envisioned that genomic data for a number of specific tracks for a number of species can be efficiently and intuitively by weaved together giving greater insight to the phylogeny of different organisms.

Main Objectives / Project Plan

  1. Understand the existing and related code. (Weeks 1-2)
  2. Split the single large image produced by the program to a number of tracks (eg. Genes, SNPs, GC content, etc.), where each track is loaded only when requested. Lincoln D. Stein has already started on the implementation and will work together with the student. (Weeks 2-5)
  3. Work on a track for multiple alignments. It should tackle duplicates, inversions, gaps and deletions. Web Miller’s Multiz program as well as the MAF (multiple alignment format) will be investigated. (Weeks 6-9)
  4. Create a track to view a phylogeny tree in line with the genome browser. It would be ideal to paint the tree based on the organisms’ multiple sequence alignments. The ATV format is seen frequently so it be looked at as a potential format to be used. (Weeks 11-12)

Individuals and Organizations involved

Mentoring Organization: National Evolutionary Synthesis Center (NESCent), Phyloinformatics Group

Student: Hisanaga Mark Okada

Mentor: Lincoln D. Stein

Code

All code for the Genome Browser as well as changes made through this project will reside on the Generic Model Organism Database Toolkit (GMOD) site on Sourceforge.

Progress

May 29

  • setting up the Genome Browser environment so that the stable version (v 1.68) runs
  • playing around with the unstable head version
  • trying to get to know how the data is processed and stored for this system

June 5

  • getting my feet wet with the existing code base,
  • trying out the tutorials to get myself comfortable with the file system,
  • and was frantically trying to get the latest version of gbrowse to run. I had many problems with this (DBI, DBD::mysql, mysql) using a number of days to try to get around the problem. It, more or less, works and I would feel relieved if I wasn't so behind schedule.

June 12

  • familiarising myself with the genome browser
  • got the HEAD CVS version working for the gbrowse_drag, a really really cool version of gbrowse that uses drag and drop for the data tracks put together by Lincoln. I've been looking through the javascript and Scriptaculous and this is definitely something to look forward for
  • working on debugging gbrowse_run, a currently unstable version of gbrowse

June 19

Although my progress isn't exactly tangible, I'm making progress in understanding the code and have worked on where the possible bug is in the code (gbrowse_run) with Lincoln in our last conversation. Until the newest version is stable I cannot make changes to the code.

June 26

Last week Lincoln and I have been troubleshooting the bugs that exist in GBrowse and I'm happy to report that it is finally running, at least the bulk of it is.

July 4

  • doing some minor fixes to stabilise the gbrowse_run script:
  1. making sure the data glyphs are drawn on separate levels (bumping))
  2. making sure the data files are read correctly (some minor data label
  • differences between the older and newer version)
  1. trying to make GBrowse render the Overview track. I'm following the old
  • code from the last stable version but it's a bit tricky as some things
  • aren't structured the same way

July 10

Last week I was working on the 'overview' track of the Genome browser which displays a zoomed out view of the current window. It is not implemented but a lot (if not all) of code from the previous and current versions can be reused. Lincoln and I discussed the call structure to be used and I will continue with that this week. I have also been looking into / working on the 'remote rendering' where the Genome Browser can delegate different tracks of data to be rendered on multiple servers using the LWP::Parallel library.

July 17

I started planning how to set up the gbrose_render script which is to create the gbrowse images remotely. I've started implementing that around Lincoln's skeleton code. I will continue with that this week.

July 24

I was working on the GBrowse remote rendering which allowed image rendering/processing on different servers. It works now and hopefully it's usable enough. It seems a little too simple.

July 30

  • worked on the Overview bar for Gbrowse, it works but it's ugly code
  • went through the tutorials on Graphics and Glyphs on the bioperl.org website

August 7

Last week I've been working on the visualisation track that is responsible for the DNA alignment / cladogram of different species. Combining the different data types, using the language of GFF3 and making use of the huge code repository has proven to be much more complicated than anticipated. I hope I can get it working soon in time for the meeting and the newsletter.

August 14

Last week I was

  • Working on the cladeogram / DNA-alignment viewing track. It displays a cladeogram on one side of the viewing window and draws either the DNA alignments or a simple histogram representing the conservation score depending on the zoom level. The alignment data it reads is from the GFF3 format as described in the Sequence Ontology webpage. Although the core of the track is made, there are a few kinks to fix.
  • The conservation scale bar graph still looks a bit empty. This is probably because my data set is extremely small and I cannot find real world data that uses GFF3 alignment data. It is currently a log scale to accommodate the exrtemely small scores.
  • I am also working on my presentation for this week's get together in NC.
  • Lincoln has listed a few alternatives that I should look into to render the conservation scores. Namely they are his DenseFeature code and the wiggle data.

Links

GSoC / Genome Browser

Tutorials / Data Descriptions

Other Genome Browsers

Papers of Interest