Difference between revisions of "PhyloSoC:phyloGeoRef"

From Phyloinformatics
Jump to: navigation, search
m (Refrences:)
(Improving visualization)
 
(7 intermediate revisions by the same user not shown)
Line 9: Line 9:
 
==Code==
 
==Code==
 
http://github.com/kdiverson/phyloGeoRef
 
http://github.com/kdiverson/phyloGeoRef
 +
 +
==Ideas for GSoC 2011==
 +
===Improvements for GSoC 2011===
 +
 +
3 main goals -- probably won't be able to address them all in the coding period
 +
*Support NeXML
 +
*Improve visualization
 +
*Support database integration
 +
 +
The first two are the most important. I think adding NeXML support and improving visualization are both doable during the coding period. Supporting NeXML is fairly strait forward in terms of what needs to be accomplished. The library should be able to read in a NeXML file, parse it and then use functions already built into the library to construct the KML file.
 +
 +
Visualization is more open ended. One of the concrete goals is to have the branch lengths of  the trees accurately represent evolutionary distance (see the implementation suggestion below). Once this is accomplished it would be nice to be able to color subtrees based on user provided data. For example, highlight all members of the proteobacteria. This would be accomplished by the user providing a list of proteobacteria as a CSV and phyloGeoRef would color branches that have those bacteria listed. This could be done for any user provided data in a CSV file. Another example would be given a list of genes and species that have those genes, highlight the species on the tree. These are just a few examples of the many possibilities for visualization. I am open to any other visualization ideas students have or think would be useful.
 +
 +
===Supporting NeXML===
 +
NeXML is becoming a standard format for representing phylogenetic information. As such phyloGeoRef should be able to support this format. Key things to be addressed are parsing the NeXML data and representing it on a phylogenetic tree and Google Earth. Parsing the data will be the most important part. There is already infrastructure in phyloGeoRef to support this information, the extraction is what needs work
 +
 +
===Improving visualization===
 +
This is a very important part of the project as the end product is a visualization of the phylogenetic tree. Currently there are a few ways of improving this. A good implementation is suggester here: http://iphylo.blogspot.com/2007/06/earth-not-flat-official.html
 +
 +
It would also be nice to display information about the tree such as bootstrap values and confidence. Right now there is a place for "metadata" but this is quite vague and it would be nice to have specific information available. For example, if a user clicks on a node a bubble should pop up with information about that species and the sample site from which it was collected. Other information can be displayed as well. This could come from a CSV file associated with the data or included in the NeXML file.
 +
 +
===Supporting database integration===
 +
This is of least importance right now, but time permitting it would be nice to add functions that could extract data from phylogenetic trees stored in a database. For example TreeBase. TreeBase supports NeXML so this should work well once NeXML support is implemented.
  
 
==Project Plan==
 
==Project Plan==
Line 15: Line 38:
 
Deliverables: psudocode for project -- what information needs to be extracted and how to store it; set up github account and integrate with Eclipse or other IDE, join mailing lists; study KML files; list of relevant java libraries
 
Deliverables: psudocode for project -- what information needs to be extracted and how to store it; set up github account and integrate with Eclipse or other IDE, join mailing lists; study KML files; list of relevant java libraries
  
===Developing parsers===
+
===Approach and organization: ===
General idea - parsers will extract data from input files and store it as outlined below. Supported file types: newick, nexus, neXML, phylip, phyloXML and more as time/demand permits.
+
Start small and expand as needed. In this case I'm staring with the basic ability to take a tree file, deliminated file and
 +
Take a select few types of files to completion, then add more
 +
 
 +
A lot of the functions I need for this project have already been written and with a little modification they can be easily implemented in my library. This will allow me to be a little more creative and expand the scope of the library and what the library can do, beyond the original intention.
  
=== Week 1: ===
+
===Week 1:===
Tasks: Combine, extend and write parsers for tree files, geographical data files and metadata files such as csv and tsv. Parsers from the java evolutionary biology library, degree and gvSIG open source geospatial libraries in addition to others will be used. These libraries already implement parsers for phylogenetic trees and geographical data. The main task would be to make sure these parsers work with the rest of the library and extend them as necessary.
+
Define tree parser
Deliverables: tree-file parsers
 
  
=== Week 2: ===
+
Define csv parser
Tasks: Implement the tree class. Generic tree classes and phylogenetic tree classes for Java already exist so I would extend these classes to accommodate adding geographical information about the nodes so the branches can be mapped in 3D space and other tagged elements can be placed on the branches. Possible class outline:
 
  
Tree object:
+
Define delimitated file parser, tsv and user defined
    node x:
 
        latitude, longitude, altitude
 
        user defined element
 
        ...
 
        children
 
  
Alternatively, it may be better to only store parent child relationships in the tree class and add the other information later -- moving the data from a tree class to the geoPhylo class outlined below.
+
Start storeTree class
Deliverables: Java tree class
 
  
===Data storage===
+
===Week 2:===
General idea- the storage system requests certain information from the parsers such as tree structure, location information and metadata and is flexible enough to pass certain requested information for output, but not necessarily all of it. The bulk of this is done with the tree class outlined above.
+
Finish storeTree class:
 +
Phylogeny array parser
 +
Define function to add latitude, longitude and altitude data to each node. this will probably change a little bit once the actual algorithm is implemented
 +
Define functions to parse shapefiles using GeoTools library
 +
Define function to add metadata to the tree -> should be an extension of, or similar to, the csv parser already implemented.
  
=== Week 3: ===
+
===Week 3:===
Tasks: Code data storage structure. Data will be stored in a geoPhylo class whose instances have attributes such as parent child relationships and latitude and longitude coordinates. Additionally, each node or leaf could have information stored within it. This information would be extracted from the tree class outlined above. Simple class outline:
+
Implement algorithm from Jamies et al, nodeAltitude = a + ((n-1)*b) where a is the altitude of a node with only terminal leafs (based on the size of the tree), b is the altitude of nodes that are ancestral to more than one node and n is the number of node from the current node to the leaf.
  
geoPhylo class:
+
===Week 4:===
root
+
Implement latitude and longitude algorithm. Leafs will get their lat/long data from the coord file.
node instance
 
branch instance
 
  
node class:
+
===Week 5:===
    parent
+
Define kml writer using jak library. Most of the KML writer functions exist in this library so my job will be to make sure the data is populated correctly. A major part of this will be telling kml how to draw the trees, ie where to put branches (line or pathway) and nodes (points).
    children
 
    latitude, longitude, altitude
 
    user defined element
 
    visual information -- color, size, etc. These will have default values unless the user specifies otherwise.
 
  
branch class:
+
===Week 6:===
    beginning latitude, longitude, altitude -> could be in the form of a KML pathway element
+
Work on KML writer -- get a tree in KML
    end latitude, longitude, altitude
 
  
Deliverable: java geoPhylo class
+
===Week 7: July 4th-10th===
 +
*Improve KML writer and the appearance of the tree.
 +
*Write function for trees with multiple leafs of the same species.
 +
*Outline algorithms for large tress and trees with globally distributed leafs
  
=== Week 4: ===
+
===Week 8: July 11-17th===
Tasks: develop interface for structure. Once the class structure is down, this is simply extracting that data. The key is to present it to the KML writer in an efficient and useful way. May not take an entire week.
+
*Implement large tree and globally distributed tree algorithms
Deliverable: functions that are called by the parsers that deliver the data to the storage structure.
 
  
Output
+
===Week 9: July 18-24th===
General idea - once data is stored combine it in a way the user requests i.e. what they want to see on the tree and display it.
+
*Add support for other file types such as neXML and possibly an SQL interface.
  
=== Weeks 5 and 6: ===
+
===Week 10:July 25th-31st===
Tasks: Write algorithm to determine geospatial information. Calculating geospatial information is no trivial task. This information will vary based on the type of output the user requests such as KML or shapefile. This algorithm needs to be able to calculate how the tree should look in 3D space keeping in mind visual clarity. This requires knowing how many nodes, branches and leafs are on the tree and where leafs are anchored to the map. The simplest way would be a cone shape, with the base made up of the leaf coordinates and then equally spaced nodes until the root is reached at the peak. This will not work for really large trees, as the map would be way too convoluted. Turning on or off some of the branches or being able to collapse them would make the map easier to understand. Basic algorithm from Janies et al is the altitude for any given node = a + ((n-1)*b) where a is the altitude of a node with only terminal leafs (this can be an arbitrary number or based on the size of the tree), b is the altitude of nodes that are ancestral to more than one node and n is the number of node from the current node to the leaf. I will also look at the algorithms implemented in GeoPhyloBuilder for node positioning. These include envelope centroid method, mean position method, and for bifurcate trees the DAVA centroid and MCP centroid methods. I will also try to extend the geospatial functions in the JTS Topology suite for 2D spatial functions. 3D mapping is fairly new to me, so it may take me some time to come up with an optimal solution. I will rely heavily on algorithms already implemented in other geospatial applications.
+
*Continue neXML and SQL support
Deliverable: geospatial algorithm
 
  
=== Week 7: ===
+
===Week 11: August 1st-9th===
Tasks: Implement the tree searching algorithm. This will be an extension of the existing tree walking functions. The main feature that needs to be added is a way to collect the geospatial, user defined and style elements contained in each node.
+
*Testing and bugfixing
Deliverable: tree searching function
 
  
=== Week 8: ===
+
===Week 12: August 10th-16th===
Tasks: Code KML writer and shapefile writer. This will probably be an extension of the Writer class specifically designed for KML and shapefile output. For KML files, styles will need to be defined here. These could be user defined or have a default value (both would be the best option). The open source GeoTools package has a shapefile writer that can be extended to fit the needs of this library.
+
*Documentation
Deliverables: KML and shapefile writer
 
  
 
==Refrences:==
 
==Refrences:==

Latest revision as of 13:25, 31 March 2011

phyloGeoRef: a geo-referencing library implemented in Java

Author

Kathryn Iverson

Abstract

The goal of this project is to create a geo-referencing library for parsing and displaying geographical and phylogenetic information. The main aim of the project is to bring together geographical and phylogenetic information in a way that is usable and useful to the user. Given an input of a tree file and geographical information the library will return a tree with phylogenetic and geographical information in a format such as KML or shapefile or a georeferenced phylogenetic format such as neXML.

Code

http://github.com/kdiverson/phyloGeoRef

Ideas for GSoC 2011

Improvements for GSoC 2011

3 main goals -- probably won't be able to address them all in the coding period

  • Support NeXML
  • Improve visualization
  • Support database integration

The first two are the most important. I think adding NeXML support and improving visualization are both doable during the coding period. Supporting NeXML is fairly strait forward in terms of what needs to be accomplished. The library should be able to read in a NeXML file, parse it and then use functions already built into the library to construct the KML file.

Visualization is more open ended. One of the concrete goals is to have the branch lengths of the trees accurately represent evolutionary distance (see the implementation suggestion below). Once this is accomplished it would be nice to be able to color subtrees based on user provided data. For example, highlight all members of the proteobacteria. This would be accomplished by the user providing a list of proteobacteria as a CSV and phyloGeoRef would color branches that have those bacteria listed. This could be done for any user provided data in a CSV file. Another example would be given a list of genes and species that have those genes, highlight the species on the tree. These are just a few examples of the many possibilities for visualization. I am open to any other visualization ideas students have or think would be useful.

Supporting NeXML

NeXML is becoming a standard format for representing phylogenetic information. As such phyloGeoRef should be able to support this format. Key things to be addressed are parsing the NeXML data and representing it on a phylogenetic tree and Google Earth. Parsing the data will be the most important part. There is already infrastructure in phyloGeoRef to support this information, the extraction is what needs work

Improving visualization

This is a very important part of the project as the end product is a visualization of the phylogenetic tree. Currently there are a few ways of improving this. A good implementation is suggester here: http://iphylo.blogspot.com/2007/06/earth-not-flat-official.html

It would also be nice to display information about the tree such as bootstrap values and confidence. Right now there is a place for "metadata" but this is quite vague and it would be nice to have specific information available. For example, if a user clicks on a node a bubble should pop up with information about that species and the sample site from which it was collected. Other information can be displayed as well. This could come from a CSV file associated with the data or included in the NeXML file.

Supporting database integration

This is of least importance right now, but time permitting it would be nice to add functions that could extract data from phylogenetic trees stored in a database. For example TreeBase. TreeBase supports NeXML so this should work well once NeXML support is implemented.

Project Plan

Bonding period

Tasks: Gather sample tree files, KML files and other files used by the library; set up development environment; learn about geographical data files; catalog java tree and geospatial libraries Deliverables: psudocode for project -- what information needs to be extracted and how to store it; set up github account and integrate with Eclipse or other IDE, join mailing lists; study KML files; list of relevant java libraries

Approach and organization:

Start small and expand as needed. In this case I'm staring with the basic ability to take a tree file, deliminated file and Take a select few types of files to completion, then add more

A lot of the functions I need for this project have already been written and with a little modification they can be easily implemented in my library. This will allow me to be a little more creative and expand the scope of the library and what the library can do, beyond the original intention.

Week 1:

Define tree parser

Define csv parser

Define delimitated file parser, tsv and user defined

Start storeTree class

Week 2:

Finish storeTree class: Phylogeny array parser Define function to add latitude, longitude and altitude data to each node. this will probably change a little bit once the actual algorithm is implemented Define functions to parse shapefiles using GeoTools library Define function to add metadata to the tree -> should be an extension of, or similar to, the csv parser already implemented.

Week 3:

Implement algorithm from Jamies et al, nodeAltitude = a + ((n-1)*b) where a is the altitude of a node with only terminal leafs (based on the size of the tree), b is the altitude of nodes that are ancestral to more than one node and n is the number of node from the current node to the leaf.

Week 4:

Implement latitude and longitude algorithm. Leafs will get their lat/long data from the coord file.

Week 5:

Define kml writer using jak library. Most of the KML writer functions exist in this library so my job will be to make sure the data is populated correctly. A major part of this will be telling kml how to draw the trees, ie where to put branches (line or pathway) and nodes (points).

Week 6:

Work on KML writer -- get a tree in KML

Week 7: July 4th-10th

  • Improve KML writer and the appearance of the tree.
  • Write function for trees with multiple leafs of the same species.
  • Outline algorithms for large tress and trees with globally distributed leafs

Week 8: July 11-17th

  • Implement large tree and globally distributed tree algorithms

Week 9: July 18-24th

  • Add support for other file types such as neXML and possibly an SQL interface.

Week 10:July 25th-31st

  • Continue neXML and SQL support

Week 11: August 1st-9th

  • Testing and bugfixing

Week 12: August 10th-16th

  • Documentation

Refrences:

Janies, D., Hill, A.W., Habib, F., Guralnick, R., Waltari, E., Wheeler, W.C., 2007. Genomic analysis and geographic visualization of the spread of avian influenza (H5N1). Syst. Biol. 56, 321–329.