From Phyloinformatics
Revision as of 18:22, 16 May 2010 by Kdiverson (talk)
Jump to: navigation, search

phyloGeoRef: a geo-referencing library implemented in Java


Kathryn Iverson


The goal of this project is to create a geo-referencing library for parsing and displaying geographical and phylogenetic information. The main aim of the project is to bring together geographical and phylogenetic information in a way that is usable and useful to the user. Given an input of a tree file and geographical information the library will return a tree with phylogenetic and geographical information in a format such as KML or shapefile or a georeferenced phylogenetic format such as neXML.


Project Plan

Bonding period

Tasks: Gather sample tree files, KML files and other files used by the library; set up development environment; learn about geographical data files; catalog java tree and geospatial libraries Deliverables: psudocode for project -- what information needs to be extracted and how to store it; set up github account and integrate with Eclipse or other IDE, join mailing lists; study KML files; list of relevant java libraries

Developing parsers

General idea - parsers will extract data from input files and store it as outlined below. Supported file types: newick, nexus, neXML, phylip, phyloXML and more as time/demand permits.

Week 1

Tasks: Combine, extend and write parsers for tree files, geographical data files and metadata files such as csv and tsv. Parsers from the java evolutionary biology library, degree and gvSIG open source geospatial libraries in addition to others will be used. These libraries already implement parsers for phylogenetic trees and geographical data. The main task would be to make sure these parsers work with the rest of the library and extend them as necessary. Deliverables: tree-file parsers

Week 2

Tasks: Implement the tree class. Generic tree classes and phylogenetic tree classes for Java already exist so I would extend these classes to accommodate adding geographical information about the nodes so the branches can be mapped in 3D space and other tagged elements can be placed on the branches. Possible class outline:

Tree object:

   node x:
       latitude, longitude, altitude
       user defined element

Alternatively, it may be better to only store parent child relationships in the tree class and add the other information later -- moving the data from a tree class to the geoPhylo class outlined below. Deliverables: Java tree class

Data storage

General idea- the storage system requests certain information from the parsers such as tree structure, location information and metadata and is flexible enough to pass certain requested information for output, but not necessarily all of it. The bulk of this is done with the tree class outlined above.

Week 3

Tasks: Code data storage structure. Data will be stored in a geoPhylo class whose instances have attributes such as parent child relationships and latitude and longitude coordinates. Additionally, each node or leaf could have information stored within it. This information would be extracted from the tree class outlined above. Simple class outline:

geoPhylo class: root node instance branch instance

node class:

   latitude, longitude, altitude
   user defined element
   visual information -- color, size, etc. These will have default values unless the user specifies otherwise.

branch class:

   beginning latitude, longitude, altitude -> could be in the form of a KML pathway element
   end latitude, longitude, altitude

Deliverable: java geoPhylo class

Week 4

Tasks: develop interface for structure. Once the class structure is down, this is simply extracting that data. The key is to present it to the KML writer in an efficient and useful way. May not take an entire week. Deliverable: functions that are called by the parsers that deliver the data to the storage structure.

Output General idea - once data is stored combine it in a way the user requests i.e. what they want to see on the tree and display it.

Weeks 5 and 6

Tasks: Write algorithm to determine geospatial information. Calculating geospatial information is no trivial task. This information will vary based on the type of output the user requests such as KML or shapefile. This algorithm needs to be able to calculate how the tree should look in 3D space keeping in mind visual clarity. This requires knowing how many nodes, branches and leafs are on the tree and where leafs are anchored to the map. The simplest way would be a cone shape, with the base made up of the leaf coordinates and then equally spaced nodes until the root is reached at the peak. This will not work for really large trees, as the map would be way too convoluted. Turning on or off some of the branches or being able to collapse them would make the map easier to understand. Basic algorithm from Janies et al is the altitude for any given node = a + ((n-1)*b) where a is the altitude of a node with only terminal leafs (this can be an arbitrary number or based on the size of the tree), b is the altitude of nodes that are ancestral to more than one node and n is the number of node from the current node to the leaf. I will also look at the algorithms implemented in GeoPhyloBuilder for node positioning. These include envelope centroid method, mean position method, and for bifurcate trees the DAVA centroid and MCP centroid methods. I will also try to extend the geospatial functions in the JTS Topology suite for 2D spatial functions. 3D mapping is fairly new to me, so it may take me some time to come up with an optimal solution. I will rely heavily on algorithms already implemented in other geospatial applications. Deliverable: geospatial algorithm

Week 7

Tasks: Implement the tree searching algorithm. This will be an extension of the existing tree walking functions. The main feature that needs to be added is a way to collect the geospatial, user defined and style elements contained in each node. Deliverable: tree searching function

Week 8

Tasks: Code KML writer and shapefile writer. This will probably be an extension of the Writer class specifically designed for KML and shapefile output. For KML files, styles will need to be defined here. These could be user defined or have a default value (both would be the best option). The open source GeoTools package has a shapefile writer that can be extended to fit the needs of this library. Deliverables: KML and shapefile writer


Janies, D., Hill, A.W., Habib, F., Guralnick, R., Waltari, E., Wheeler, W.C., 2007. Genomic analysis and geographic visualization of the spread of avian influenza (H5N1). Syst. Biol. 56, 321–329.