PhyloSoC:Enhance the searching functionality of Phylr

From NESCent Informatics Wiki
Jump to: navigation, search

Abstract: PhyloWS is an emerging standard for interacting with phylogenetic data via web services. Phylr, is an initial implementation of the SRU search in PhyloWS. The goals of this project are to enhance the Java code that translates between the SRU server and a Lucene index and make it fully conform to the PhyloWS specs, modify the XSL stylesheets to present a more intuitive interface to the users, and write a new Java adapter that translates between the SRU server and a relational database.

Student: Dazhi Jiao

Mentor(s): Ryan Scherle (primary), Lucie Chan

Project Blog: Phylr GSoC blog

Source Code: Phylr GoogleCode site

Issues: Google Code issues

Contents

Details of implementation and timeline

(updated 9/1/09) The programming for this project was broken into three parts: improve the lucene adapter and implement full support of the finder tasks "Find trees by metadata" and "Find matrices by metadata" in the PhyloWS/REST specification; implement an SRU adapter for a relational database that supports the two finder tasks; modify client side XSLT to improve user experience.

The overall planned structure of the system is shown in the following figure. The upper part is the Lucene adapter and the lower part is the relational database adapter. (*The actual implementation is a little different, as recorded below*). Detail of the system is described below:

Phylr-struct.jpg

Lucene Adapter

Most of the architecture of this system has been implemented in the current Phylr implementation.

The lucene indexer (org.nescent.phylr.lucene.NexmlIndexer) is based on NeXML on file system. It extracts metadata from the NeXML files and indexes them. It uses XPath to extract information from NeXML. Basically, it reads a file (an example can be found at http://code.google.com/p/phylr/source/browse/trunk/phylr-gsoc/src/test/resources/predicates.txt) that defines searchable fields and corresponding XPaths, and extract the data from the XPath and index them with the fields.

The previously implemented org.oclc.os.SRW.lucene.SRWLuceneDatabase is in charge of the forwarding works to different classes, and putting pieces together. So in GSoC, all I did was to customize different types of classes that are used by the SRWLuceneDatabase. All classes are in the package org.nescent.phylr.lucene. Please note that the properties file TreeIndex.props is used to tell the system what customized class to be loaded.

So here I've implemented org.nescent.phylr.lucene.DCRecordResolver and org.nescent.phylr.lucene.NexmlRecordResolver. They are called to resolve record in the results to the right format. So DCRecordResolver will generate a Dublin Core format of XML, and NexmlRecordResolver will simply grab the NeXML file (which is in a lucene index field), and inserted it in to a record.

Relational Database Adapter

The architecture of the system is similar to the Lucene adapter. All classes are in the package org.nescent.phylr.relational.

I've tried to take the approach to make the classes generic and reusable. I've tried to push the query out of the code (they are put in the properties file).

The database is a BioSQL database, with PhyloDB extension. I migrated data from a TreeBase database into this database. The steps of the migration is described in the mirgration SQL script.

More details of the individual pieces in the adapter are described as below:

SRWRelationalDatabase org.nescent.phylr.relational.SRWRelationalDatabase was implemented. It extends ORG.oclc.os.SRW.SRWDatabase. It override necessary methods, such as getQueryResult. As SRWLuceneDatabase, this class also takes a properties file, which can be configured by the user. In the properties file, the user can define what RecordResolver can be used, and what the pieces of queries should look like. See BioSQL.props for example.

CQL/SQL Translator org.nescent.phylr.relational.BioSQLQueryTranslator was implemented. The class does the work to translate the CQL to SQL, based on the database schema. I've tried to make this class generic. It uses SQL pieces defined in the properties file. Each SQL piece corresponds to a field. Thus the overall SQL can be built by putting pieces together in this class.

QueryResult, RecordResolver, and RecordIterator These classes follow the same structure in the Lucene adapter. One thing to point out is that I used the Commons DBUtils package to facilitate the database connections and queries. The results are processed in BioSQLResultSetHandler.

Implementation Timeline

Implementation Timeline
Dates Plan Weekly Report
~ 5.25
  • Created accounts, joined mailing lists.
  • Setup source code sandbox. Built the SRU server, and deployed locally.
  • Walked through nexmlLuceneIndexer.
  • Played with SAX based nexml parser.
5.25 – 5.31
  • Adding tasks to google code issues.
  • Testing the local deployment with OCLC SRU tester, and setup a weekly process to do it.

Develop indexer to index matrix nexml

  • Setting up nexmlLuceneIndexer after getting nexml files from Ryan.
  • Modify the class org.nescent.dbhack.phylr.mdextractor.MdExtractor to use SAX based parser instead of XPath.
  • Add code to org.nescent.dbhack.phylr.mdextractor.MdExtractor to parse matrices nexml files.

Extend the BasicLucenQueryTranslator

  • Post wiki page that list all the current CQL syntax supported by the BasicLuceneQueryTranslator.
  • Identify the syntax to be added to BasicLuceneQueryIndexer with Ryan.
  1. Built project, and deployed it on a local server
  2. Tested the local SRU deployment with OCLC SRU service tester.
  3. Installed maven, and used it to compile/execute the nexmlLuceneIndexer. Also generated eclipse project files using maven.
  4. Downloaded nexml java parser from SVN, and built a jar file. Downloaded nexml files form svn. These nexml files were used for testing nexmlLuceneIndexer and the SRU server.
  5. Added nexml.jar file to a local maven repository. We might eventually create an internal maven repository, so that jar files can be put in a central place.
  6. Wrote code to parse tree files. I found there is no easy way to parse nex:external data. The answers I got from the nexml server indicate that the nex:external is only a hack, and should not be supported. I am in dicussion with Ryan and Lucie on this issue.
  7. Wrote code to parser matrices files. Same problem as #6 acquainted.
  8. On wiki page, added CQL syntax analysis of the current SRU implementation.
6.1 - 6.7
  • Figure out nexml issues with Ryan and Lucie. Decide whether it's necessary to update the nexml to use annotations.
  • Complete code to use nexml java parser.

Extend the BasicLucenQueryTranslator

  • Implement the processing of CQL syntax based on the decision made in previous week
  • Run nexmlLuceneIndexer to index sample nexml files

Build DCRecordGenerator for retrieving trees and matrices.

  • Implement DCRecordGenerator. It will query the Lucene Index. Write code to generate result record in Dublin core from the lucene results.
  • Modify BasicLuceneRecordResolver to call DCRecordGenerator for record when the recordSchema is "DC"
  • Discussed with Ryan, and decided to update nexml to use annotations
  • Created a new project, merged oclcSRW and oclcLuceneAdaptor, using maven
  • Setting up local maven repository
  • Created build script for the new project
  • Implmeneted DCRecordResolver
  • Modified BasicLuceneRecordResolver
6.8 – 6.14 Project setup
  • Complete build script, for packaging, running, deploying the application
  • Check in the project to svn repository
  • create maven internal repository on google code

Nexml

  • Send questions to nexml mailing list on how to use the meta tags for the current files
  • finish the parser

Build NexmlRecordGenerator for retrieving trees and matrices.

  • Create the class NexmlRecordGenerator. Write a method to find the tree based nexml file, and read the content.
  • Wrap the content of the nexml in the response xml.
  • Complete build script, for packaging, running, deploying the application
  • Check in the project to svn repository
  • Create the class NexmlRecordResolver. Modified configuration file to load the correct resolver class for nexml and dc
  • Wrap the content of the nexml in the response xml.
6.15 - 6.21 Build NexmlRecordGenerator for retrieving trees and matrices
  • Write another method to parse matrices based nexml. Maybe I could reuse the code in the nexmlLuceneIndexer.
  • Write code to find the write characters that is requested in the matrices.
  • Write code to generate a new nexml, with only information requested
  • Write code to wrap the new nexml into response xml (reuse code created before)
  • Created maven internal repository on google code
  • Completed build process. deployable war file can be automatically built by maven now.
  • Created a nexml parser testing file, to generate nexml with //meta elements
  • converted nexml file that has <external> tag to <meta> tags. Break up metadata into list of Dublin Core and Prism records
  • Reviewed cql predicates to nexml xpath mappings (created by Rutger Vos), modified xpaths to make them valid
  • Created org.nescent.phylr.lucene.NexmlIndexer. Implemented methods to use xpath to extract information from nexml, and create lucene index fields.
  • Created rg.nescent.phylr.lucene.NexmlIndexerTest, and added testing building process in the maven managed project
6.22 - 6.28
  • Continue discussion on mailing lists about nexml/cql predicates mapping
  • finish NexmlIndexer and nexml indexing process
  • write classes to modify nexml, or extract tree nodes/matrices
6.29 - 7.5 Implement tasks for client side XSLT customization (milestone: client side XSLT)
  • Write documentation on lucene based phylr service
  • Implement XSLT to transform result page when recordSchema=nexml
  • Implement XSLT for a better search form
  • Write documentation on lucene based phylr service
  • Implement XSLT to transform result page when recordSchema=nexml
7.6 midterm evaluation No
7.6-7.12 Implement CQL/SQL translator for relational database
  1. Created classes RecordResolver, BasicRelationalRecordResolver. The mothods are to be completed with real data.
  2. Implemented SRWRelationalDatabase. It uses JDBC to connect to a database and run queries.
  3. Implemented RelationalRecordIterator, RelationalQueryResult. They are also based on JDBC.
  4. Created classes CQLQueryTranslator, BasicSQLTranslator with blank methods. The methods are still need to be implemented.
  1. Created classes RecordResolver, BasicRelationalRecordResolver. The mothods are to be completed with real data.
  2. Implemented SRWRelationalDatabase. It uses JDBC to connect to a database and run queries.
  3. Implemented RelationalRecordIterator, RelationalQueryResult. They are also based on JDBC.
  4. Created classes CQLQueryTranslator, BasicSQLTranslator with blank methods. The methods are still need to be implemented.
7.13-7.19 1. Setup a BioSQL database, using PostgreSQL.

2. Map the TreeBase schema to BioSQL, and import the TreeBase dump into BioSQL database.

1. Setup a BioSQL database, using PostgreSQL. 2. Created a mapping the TreeBase schema to BioSQL

7.20-7-26
  1. import the TreeBase dump into BioSQL database.
  2. Implement SRWBioSQLDatabase, using BioJava ORM API for BioSQL database
migrated data from TreeBase dump into BioSQL database
7.27 -8.2
  1. Implement SRWBioSQLDatabase. BioSQLRecordIterator, BioSQLQueryResult
  2. Implement BasicBioSQLTranslator. It should translate a CQL query to a BioJava call to retrieve objects
# Implemented SRWBioSQLDatabase. BioSQLRecordIterator, BioSQLQueryResult
  1. Implemented BioSQLDCRecrodResolver. It creates a Record from a query result from the BioSQL database
  2. Implemented BioSQLQueryTranslator. It builds a SQL query from a CQL query. The query statements are configurable.
  3. Implemented SQL statements to query the BioSQL database
  4. Implemented support for several diagnostic message
8.3 - 8.9
  1. Setting up testing server, and database
  2. Implement database connection pooling
  3. Start to document
No
8.10 stop coding, writing report, and documentation. No

Implementation

CQL features

Relations

Relation Supported? Notes
= Y
< N
> N
<= N
>= N
<> N
scr Y same as =
exact Y Added for IN Harmony and only currently works on that database. Will eventually work on all.
all Y
any Y
within Y Works, but has unexpected behavior if performed on fields with multiple values, or that stores data in unusual ways (TOKENIZED). Support will be added to make this work better and to make it work as expected for dates.
encloses N should be implicit for our date searches

Booleans

Boolean Supported? Notes
and Y
or Y
not Y
prox N will need for text collections, must define supported modifiers

Modifiers

We don't have an immediate need for modifiers, but we will eventually want to support much of the bibliographic profile.

Wildcards, etc.

We need to support wildcards. Anchors can be ommitted for now.

CQL context set

field support details
resultSetId yes Support for re-fetching a resultset is optimized, and support for combining a resultset with others or with other search clauses is supported.
serverChoice index supported
allRecords supported
anyIndexes supported

Relation support is noted above (in general support).

None of the modifiers/qualifiers are supported.

DC context set

  • All* of the DC context set is supported.

Bib context set

None of the bib context set is supported, but we will gradually start using this as needed.

Sorting

Do we want to support the new sortby feature? (This is more practical than the old XPath entry in the regular SRU parameters). What types of sort do we need?