PhyloSoC: Manipulating NGS data for population genetic analysis

From Phyloinformatics
Jump to: navigation, search



[1] Reading/installing/practicing MongoDB for sequence database

[2] Reading/installing/practicing git for committing code

[3] Set-up account on GitHub for sharing code

[4] Decide between Python and Java as the language for proposed GUI and text formatting

[5] Join mailing lists, introduction to GSoC community

[6] Determine name for program

[7] Update Project Description and Timeline with input from Jeremy and Brad

[8] Install BioPython onto work computer and laptop

[9] Background research on Galaxy

[10] Consider a blog for progress of project over the summer

ACCOMPLISHED (as of May 23):

[1] Moderate grasp on MongoDB – will likely greatly increase in skill when coding period begins

[2] Rudimentary grasp on git, not sure how much I need to know. Able to commit code to github from both my machines

[3] Obtained account, established repository for GSoC project

[4] Leaning toward Python, as I’d like to learn it and seems as functional as Java

[5] Joined Google student list, NESCENT list and emailed introductory email

[6] Using "lociNGS" currently, reserve the right to think of something better.

[7] Updated and perpetually in flux.

[8] Included installing the latest Python (2.7.1), ActiveTcl-8.5, gfortran, NumPy and SciPy, in addition to BioPython. Up and running on both machines.

[9] Progress will continue through Utility Coding period (see below)

[10] Started a blog at

CODING PERIOD: May 23 - July 17 (8 weeks): Utility Coding

Week1(May 24) –

[1] input loci (FASTA format) into MongoDB database; use Biopython (Bio.Seq, Bio.Alphabet?, Bio.SeqRecord, Bio.SeqIO)

ACCOMPLISHED (as of May 26):

[1] Input locus files (metadata only) to MongoDB. Current database structure

     Database = test_database
           Collection = loci
                 Documents = Locus1, Locus2, Locus3, Locus4, Locus5
                       Attributes = _id, length, name, path, SNPs, individuals[array]

This should allow quick retrieval of any of these attributes, accessible by any of the other attributes or the document itself. Since path is one of the attributes, we should be able to access primary fasta files for retrieval of sequences not being input to the database (this is next on the to do list, confirm and code this).

Week2(May31) –

[1] Need to retrieve sequences from FASTA files referenced in MongoDB (use Bio.SeqIO)

[2] Input SAM formatted data to MongoDB database (new attributes: number of reads per locus within an individual. This will require giving each individual in the individual[array] contained within each locus document it’s own set of attributes. Will also need to be able to access the sequences from the SAM files); find/write SAM file parser

ACCOMPLISHED (as of June 3):

[1] Added a category to the database for the path to the locus fasta files. Currently able to access those files and read them as Seq Objects using SeqIO in Biopython (as necessary). Will continue to work on retrieving sequences from SAM/BAM files for output of all reads used to call a locus.

[2] Finished a script that counts all the reads that align to each locus in the reference and updates the "" category in the database with the count. Database structure (for one locus) is currently:

{ "_id" : ObjectId("4de65e73318a122d51000001"),
"locusNumber" : "2",
"SNPs" : 28,
"locusFasta" : "RAIL_2.fasta",
"length" : 287,
"individuals" : { "R01" : 4,
	"R02" : 12,
	"R03" : 3,
 "path" : "fasta/RAIL_2.fasta" }

Week3(June 7) –

[1] Input demographic (and quality) data to MongoDB database; create example files; find/write .qual file parser (focus on 454 data to start with, add other NGS platform quality files with available time)

[2] Retrieving sequences from a SAM/BAM file on demand

ACCOMPLISHED (as of June 13):

[1] Demographic data successfully loaded into database using fake example files

[2] Saved path to SAM/BAM files in database, accessible via script.

[3] Began next week's goal of outputting summary tables containing information about individuals

Week4(June 14) –

[1] Output summary table (text) for all individuals; export from MongoDB number of loci, associated demographic data and platform

[2] Begin output locus table for an individual; export from MongoDB length of each locus, number of reads used to call the locus, number of individuals from dataset that also have that locus.

[3] Script that will find loci that contain at least one individual from a given subset of populations - important!!

JUNE 17 – 21: Attend Evolution 2011, Norman, OK

Week5(June 21) –

[1] Output locus table for an individual; export from MongoDB length of each locus, number of reads used to call the locus, number of individuals from dataset that also have that locus.

ACCOMPLISHED (as of June 26):

Week4[1] Able to output demographic data of all individuals to screen

Week4[2]/Week5[1] Able to output locus information to screen

Week4[3] Began script for finding loci that contain a subset of populations (instead of subset of individuals). Decided to add a "populations" collection to db.loci, in order to query against populations directly, instead of cross check every individual upon querying for populations.

Week6(June 28) –

[1] Finish adding "populations" collection to db.loci.

[2] Perform data conversions between output formats (Python scripts); output formats = IMa2, migrate (in that order); use previously written PERL scripts? [Converting to NEXUS format already available with Biopython.]

ACCOMPLISHED (as of July 4):

[1] Added "populations" collection to db.loci.

[2] Finished IMa2 format conversion script.

Week7(July 5) –

[1] Migrate format for loci.

Week8(July 12) –

[1] Output sequence reads for a given individual/locus pair; pull these from MongoDB database, hopefully will be easy to do

CODING PERIOD: July 18 - August 15 (4 weeks): GUI Functionality

Week9(July 18) –

[1] Wrapping utilities with Galaxy

Week10(July 25) –

[1] Wrapping utilities with Galaxy

Week11(August 1) –

[1] Compiling utilities into stand-alone GUI

Week12(August 8) –

[1] Compiling utilities into stand-alone GUI