Catalog of Metadata
- There exists many different standards for the structure and content of metadata documents, and this already large number is growing. Furthermore, since many of these metadata documents are not exactly interesting reading, there is a strong tendency for researchers and implementors to go out on their own and create a new standard rather than spend the time and effort necessary to discover, let alone digest these typically cumbersome, dense documents. This of course simply exacerbates the situation, as these fringe projects become successful and their corresponding metadata standard, adding to the already large pile of documents. By providing a simple catalog of existing metadata standards and a mechanism for searching them for keywords and/or phrases, it is hoped that the proliferation of metadata standards (at least for our target community) will be reduced because a) a list of standards will be readily available in the catalog, and b) simple search, browse and keywording/tagging mechanisms will assist interested parties in discovering standards of direct or related interest.
- Several components need to be adapted or implemented for a successful outcome to be reached, though perhaps the most important is the identification of crucial terms to use when indexing the metadata documents (i.e. metadata about the metadata). Something as simple as Dublin Core (http://dublincore.org/) is expected to provide considerable assistance for discovering relevant documents within the system. The major components to be configured / adapted / implemented are:
- An information retrieval indexing service such as the Apache Lucene based SOLR system providing fuzzy search capabilities
- A set of tools that can parse various formats of metadata documents and generate index entries for the indexing service
- A web interface for searching and browsing the catalog, and a management interface for editing references to metadata documents.
- Apart from the usual issues encountered when combining several libraries and applications developed for different purposes, it is expected that some interesting intellectual challenges will arise as well. One immediate issue is that there is no guarantee a consistent set of terms can be identified for the meta-metadata index, especially when considering the complexity of some scientific metadata standards such as EML (http://cvs.ecoinformatics.org/cvs/cvsweb.cgi/eml/”) and FGDC (http://www.fgdc.gov/metadata/csdgm/). Building an efficient content extraction framework that operates over several different types of content (e.g. HTML, plain text, MS Word Documents) with minimal reconfiguration requires both skill with low level document parsing techniques and higher level knowledge extraction tools, the later of which are likely to require some reconfiguration to operate effectively with scientific rather than general literature.
- Involved Toolkits or Projects
- Apache SOLR (http://lucene.apache.org/solr/) and Lucene (http://lucene.apache.org/java/docs/)
- Various low level parsing tools (e.g. XML and HTML parsers)
- Knowledge extraction tools such as KEA (http://www.nzdl.org/Kea/) and the Natural Language Toolkit (http://www.nltk.org/)
- A web framework such as Tomcat or Django
- Dave Vieglais (vieglais at ku edu)