PhyloSoC:Export ontology-based phenotype descriptions to EOL

From Phyloinformatics
Jump to: navigation, search

Mentors

Jim Balhoff (primary)

Chris Mungall

Matt Yoder

Cyndy Parr

Project description

Short abstract

This project involves developing a system that will map the phenotypic data from an OBD database to the EOL transfer schema. This implies determining what phenotypic information can be used and creating human-readable segments of text that can be integrated in a Encyclopedia of Life page.

Progress

Details regarding the development of this project will be posted on the Ontophenotype blog.

Documentation

Using the executable jar

The executable jar can be downloaded from ontophenotype-eol.jar. No other external references are required.

The jar can be also be obtained by pulling and building the project from github.

The jar can be run from the command line with the following set of parameters:

java -jar ontophenotype-eol.jar -i input_file_path -o export_file_path [-links] [true|false] [-group] [true|false]
  • -i - Mandatory parameter. The full path of the JSON file with the phenotype annotations for taxa. Such files can contain any number of taxa and can be obtained by requesting an export in JSON format from the Phenoscape Knowledgebase phenotype annotations to taxa section found at http://kb.phenoscape.org/taxon_annotations
  • -o - Mandatory parameter. The full path of the file where the export file will be written. The file should have the .xml extension.
  • -links - Optional parameter. Boolean value indicating whether links to anatomy terms definitions should be used in the export file.
  • -group - Optional parameter. Boolean value indicating whether anatomy terms should be grouped in anatomical systems in the export file.

The implicit value of the two optional parameters is set to true . Running the jar with only the first two parameters will produce an export file that contains links to anatomy terms definitions and groups of anatomical systems.

Project plan

Community Bonding Period

  • Subscribe to recommended mailing lists
  • Start github repository and check git settings and connectivity
  • Start project blog

Week 1 (May 23 - May 29)

  • Write a parser for the Phenoscape data in the JSON format.

Insert non-formatted text here

Week 2 (May 30 - June 5)

  • If available, use phenotype annotations to taxa in the OWL format.
  • Given two ore more taxa with phenotype annotations, find common anatomical structures between those.
  • Given a starting taxon and other taxa, find the phenotype data that differentiates the first taxon from the others. In other words, find what makes a taxon anatomically particular compared to others.

This will be done using only the information in the provided taxa files.

It would be interesting to present such information in an EOL page. For example, in a section about a species morphology, a phrase like "This species sets apart from the others in the [species_genus] by the lack of [phenotype_entity]." could be automatically generated. Another example is for a genus page : "A common denominator for this genus is the presence of [phenotype_entity] or the absence of [phenotype_entity]".

Week 3 (June 6 - June 12)

Milestone week.

  • In necessary, finish any method from last week.
  • Cover all the methods implemented so far with tests using the Ictalurus genus and the Ictalurus australis and Ictalurus punctatus taxa.
  • Clean all the code so far and pushed it to github.

Week 4 (June 13 - June 19)

At this point, I have extracted phenotypic traits and those can serve as a starting point for the project module that builds human-readable text for EoL. For this week, I've set the following tasks:

  • Experiment different methods for building text from phenotype annotations. (Here additional information on this these annotations would be useful. Subtask : find out how I could obtain such information).
  • Begin writing code for this module.

Week 5 (June 20 - June 26)

  • Figure if it is possible to obtain common names for at least some of the phenotype terms (versus the scientific names) so that a simple user will not have difficulties understanding the text generated with these terms.

Week 6 (June 27 - July 3)

Milestone week.

This week I will work towards obtaining an intermediate result. This will be represented by an EoL export file. I can outline the following subtasks:

  • At this point, I should have at least a simple method that generates text from phenotype descriptions. I will map the taxa information and this text to an element of the EoL schema.
  • Having the mappings, I will generate an XML file. An example of this process is given here : http://wiki.eol.org:8081/display/dev/Creating+Content+Connectors+for+EOL . The examples presented on that page are written in PHP and Ruby, but I will provide a Java implementation.
  • Use an XML validator to validate the resulted file according to the EoL Tranfer Schema format.

I want to give here a few more details on how I intend to do the above mentioned mappings.

For taxa, I will use at first only the required elements :

dc:identifier
dwc:ScientificName

For text generated form phenotype descriptors , I will build a <dataObject> with the following information:

<dataType>
    http://purl.org/dc/dcmitype/Text
</dataType>
<subject>
    http://rs.tdwg.org/ontology/voc/SPMInfoItems#Morphology
</subject>
<dc:description xml:lang="">
    actual text
</dc:description>

Week 7 (July 4 - July 10)

For this week, I plan to return to the side of the project regarding information extraction from the phenoscape data.

  • Build a resource that enables mappings between a phenotype quality and a possible use in a sentence.
  • Use that resource in the text generation module.
  • Experiment with adding a SPMInfoItems#DiagnosticDescription subject in the EoL export file.

Week 8 (July 11 - July 17)

  • Simplify the structure of the XML file that describes patterns for using phenotype quality terms in simple sentences.
  • Correct the parser in order to work with new structure.
  • Write tests for parsing and text generation subtasks.

Week 9 (July 18 - July 24)

  • Use synonyms for the words described in the XML (left from last week)
  • Find a way to combine multiple phenotype quality terms in a single sentence. ( While adding more quality terms to the patterns file, I realized that many of these terms have the same place in a sentence, so generating a similar sentence for each one of them can be redundant.)

Week 10 (July 25 - July 31)

  • Register as a EOL content partner and created a resource description. Provide the export file for harvest and preview the content in a EOL page.
  • Add a method to provide a single export file for multiple taxa.
  • Implement a method to combine multiple export files into a single one.

Week 11 (August 1 - August 7)

  • Link anatomical entities to their definition page on the Phenoscape Knowledgebase and write them in the export file.
  • Add a link to the source of the data, the taxon’s page from the Phenoscape Knowledgebase.
  • Begin code review.
  • Group anatomical entities in anatomical systems with the help of the Teleost Anatomy Ontology and provide export files with the text in data objects having a proper html formatting.
  • Complete the required information for the EOL content partner registration and the resource description.

Week 12 (August 8 - August 14)

  • Review code and comments.
  • Prepare documentation.
  • Provide a final export file for harvest in EOL.

Week 13 (August 15 - August 21)

  • Provide executable jar.
  • Provide instructions for using the jar.