User:Vieglais/SOC VTM

From Phyloinformatics
Jump to: navigation, search

Vocabulary Term Mapping

Rationale
One common and significant hurdle encountered while working with (merging, extracting, indexing) multiple data sets is the determination and mapping of semantically equivalent terms. Metadata descriptions of data sets can provide crucial information for determining data element semantics, though in many cases determining equivalence (or otherwise) of terms described therein is a challenge exacerbated by the sheer number of terms that may be described (e.g. even a simple standard such as the Darwin Core - - has over a hundred different terms described). When mapping terms between different there are N1xN2 comparisons to be made, where N is the number of terms defined in document 1 and 2 respectively. This large number of comparisons can be overwhelming, and it would be very helpful to have a tool that given two sets of terms and their descriptions, was able to return a ranked list of matching candidate terms (i.e. for each term in document 1, find all potentially related terms in document 2 and order them by relevance). A simple tool such as this could provide considerable assistance when attempting to map semantically similar terms between metadata, and thus semantically similar or equivalent content in data sets.
Approach
Assuming one can obtain through some device a list of terms and their human readable descriptions from metadata documents. The goal of this project is to take two such lists, and for each term + description in list 1, find all the related terms in list 2, rank them by relevance, and provide a summary as to the rules that determined the matches and relevance. Algorithms for determining similarity can range from simple word + thesaurus counting tools through more sophisticated algorithms available in the various natural language processing tool kits.
Challenges
Automated information extraction is an ongoing area of research, although it is expected that considerable progress could be made on this project with the use of existing open source tools. One significant challenge may be related to the availability of relevant vocabularies and thesauri for the topics of interest in a programmatically useful format such as SKOS (http://www.w3.org/2004/02/skos/). The relevance calculation algorithm is obviously core to the efficient functioning of this system, and will likely need considerable attention to achieve satisfy results when operating on sophisticated technical descriptions from the various sciences for which metadata documents are available.
Involved Toolkits or Projects
Mentor
Dave Vieglais (vieglais at ku edu)