PhyloSoC:Enhance the searching functionality of Phylr
Abstract: PhyloWS is an emerging standard for interacting with phylogenetic data via web services. Phylr, is an initial implementation of the SRU search in PhyloWS. The goals of this project are to enhance the Java code that translates between the SRU server and a Lucene index and make it fully conform to the PhyloWS specs, modify the XSL stylesheets to present a more intuitive interface to the users, and write a new Java adapter that translates between the SRU server and a relational database.
Student: Dazhi Jiao
Project Blog: Phylr GSoC blog
Source Code: Phylr GoogleCode site
Issues: Google Code issues
Details of implementation and timeline
(updated 9/1/09) The programming for this project was broken into three parts: improve the lucene adapter and implement full support of the finder tasks "Find trees by metadata" and "Find matrices by metadata" in the PhyloWS/REST specification; implement an SRU adapter for a relational database that supports the two finder tasks; modify client side XSLT to improve user experience.
The overall planned structure of the system is shown in the following figure. The upper part is the Lucene adapter and the lower part is the relational database adapter. (*The actual implementation is a little different, as recorded below*). Detail of the system is described below:
Most of the architecture of this system has been implemented in the current Phylr implementation.
The lucene indexer (org.nescent.phylr.lucene.NexmlIndexer) is based on NeXML on file system. It extracts metadata from the NeXML files and indexes them. It uses XPath to extract information from NeXML. Basically, it reads a file (an example can be found at http://code.google.com/p/phylr/source/browse/trunk/phylr-gsoc/src/test/resources/predicates.txt) that defines searchable fields and corresponding XPaths, and extract the data from the XPath and index them with the fields.
The previously implemented org.oclc.os.SRW.lucene.SRWLuceneDatabase is in charge of the forwarding works to different classes, and putting pieces together. So in GSoC, all I did was to customize different types of classes that are used by the SRWLuceneDatabase. All classes are in the package org.nescent.phylr.lucene. Please note that the properties file TreeIndex.props is used to tell the system what customized class to be loaded.
So here I've implemented org.nescent.phylr.lucene.DCRecordResolver and org.nescent.phylr.lucene.NexmlRecordResolver. They are called to resolve record in the results to the right format. So DCRecordResolver will generate a Dublin Core format of XML, and NexmlRecordResolver will simply grab the NeXML file (which is in a lucene index field), and inserted it in to a record.
Relational Database Adapter
The architecture of the system is similar to the Lucene adapter. All classes are in the package org.nescent.phylr.relational.
I've tried to take the approach to make the classes generic and reusable. I've tried to push the query out of the code (they are put in the properties file).
More details of the individual pieces in the adapter are described as below:
SRWRelationalDatabase org.nescent.phylr.relational.SRWRelationalDatabase was implemented. It extends ORG.oclc.os.SRW.SRWDatabase. It override necessary methods, such as getQueryResult. As SRWLuceneDatabase, this class also takes a properties file, which can be configured by the user. In the properties file, the user can define what RecordResolver can be used, and what the pieces of queries should look like. See BioSQL.props for example.
CQL/SQL Translator org.nescent.phylr.relational.BioSQLQueryTranslator was implemented. The class does the work to translate the CQL to SQL, based on the database schema. I've tried to make this class generic. It uses SQL pieces defined in the properties file. Each SQL piece corresponds to a field. Thus the overall SQL can be built by putting pieces together in this class.
QueryResult, RecordResolver, and RecordIterator These classes follow the same structure in the Lucene adapter. One thing to point out is that I used the Commons DBUtils package to facilitate the database connections and queries. The results are processed in BioSQLResultSetHandler.
|5.25 – 5.31||
Develop indexer to index matrix nexml
Extend the BasicLucenQueryTranslator
|6.1 - 6.7||
Extend the BasicLucenQueryTranslator
Build DCRecordGenerator for retrieving trees and matrices.
|6.8 – 6.14|| Project setup
Build NexmlRecordGenerator for retrieving trees and matrices.
|6.15 - 6.21|| Build NexmlRecordGenerator for retrieving trees and matrices
|6.22 - 6.28||
|6.29 - 7.5|| Implement tasks for client side XSLT customization (milestone: client side XSLT)
|7.6-7.12|| Implement CQL/SQL translator for relational database
|7.13-7.19|| 1. Setup a BioSQL database, using PostgreSQL.
2. Map the TreeBase schema to BioSQL, and import the TreeBase dump into BioSQL database.
1. Setup a BioSQL database, using PostgreSQL. 2. Created a mapping the TreeBase schema to BioSQL
||migrated data from TreeBase dump into BioSQL database|
|| # Implemented SRWBioSQLDatabase. BioSQLRecordIterator, BioSQLQueryResult
|8.3 - 8.9||
|8.10||stop coding, writing report, and documentation.||No|
|scr||Y||same as =|
|exact||Y||Added for IN Harmony and only currently works on that database. Will eventually work on all.|
|within||Y||Works, but has unexpected behavior if performed on fields with multiple values, or that stores data in unusual ways (TOKENIZED). Support will be added to make this work better and to make it work as expected for dates.|
|encloses||N||should be implicit for our date searches|
|prox||N||will need for text collections, must define supported modifiers|
We don't have an immediate need for modifiers, but we will eventually want to support much of the bibliographic profile.
We need to support wildcards. Anchors can be ommitted for now.
CQL context set
|resultSetId||yes||Support for re-fetching a resultset is optimized, and support for combining a resultset with others or with other search clauses is supported.|
Relation support is noted above (in general support).
None of the modifiers/qualifiers are supported.
DC context set
- All* of the DC context set is supported.
Bib context set
None of the bib context set is supported, but we will gradually start using this as needed.
Do we want to support the new sortby feature? (This is more practical than the old XPath entry in the regular SRU parameters). What types of sort do we need?