Professional Documents
Culture Documents
Alva Couch, Tufts University; Martin Seul, Yaping Xiao, Richard Hooper, CUAHSI
{acouch,mseul,yxiao,rhooper}@cuahsi.org
ABSTRACT
ARCHITECTURE
The CUAHSI Hydrologic Information System (HIS) Central Catalog is currently based
upon the structure of a data server (without the data).
raw
metadata
searchable
metadata
This limits its utility and ability to utilize state-of-the-art catalog technologies.
We have undertaken a ground-up redesign of the catalog in order to:
Data
sources
SOLR
Search
Engine
(Linux)
Harvesting
There have been significant difficulties in trying to update the existing catalog for
Changes in metadata.
searchable
documents
describing data
resources
Matching
documents
Metadata
Document
Store
SOLR query
results
LIMITATIONS
CULTURE SHOCK!
data
discovery
clients
blue: we provide
green: provided by SOLR
brown: external
Query
Front End
schema.xml
This led to many tables, views, and functions of unknown provenance and use.
dataconfig.xml
OBJECTIVES
High modularity: catalogs for different data types do not interact or conflict.
Scalability to immense numbers of data sets and an unlimited number of new data
types.
SOLR Data
Import
Handler
data
sources
Metadata
Document
Store
STRATEGIES
Use off-the-shelf catalog technologies.
Record raw metadata from the information source rather than fitting metadata into
existing schemas.
Formulate
SOLR query
SOLR
determines
content of result
SOLR XML
OUTPUT
Select XSLT
specification
XSLT
transducer
determines
format of result
Desired
XML result
CRITIQUE
CHOICES
SOLR simplifies search and discovery, at the expense of redundant storage and/or
redundant queries.
1. ElasticSearch.
2. MarkLogic.
3. SOLR our choice for now.
Linux
SQL
NoSQL
Tables
Write a harvester
* What SOLR calls a join is not a join at all; it is a kind of filter that uses data from other
collections in filtering. In some circles this is called a join filter rather than a join.
CONCLUSIONS
One cannot eliminate complexity from a project, but one can move it around.
In adopting SOLR rather than SQL for the catalog, we moved complexity to where it
counts:
1. Capturing the proper metadata: by configuring the SOLR harvester.
Write a front end that translates raw metadata into useful forms.
Windows
user format
requirements
user data
query
Thus, it may be necessary to record the same metadata in several places within
SOLR to enable different kinds of searches.
It is often necessary to precompute relationships between documents that one
needs.
ACKNOWLEDGEMENTS
Caleb Malchik, Yunjie Lu, Janeth Jepkogei (Tufts University students)
This work was supported in part by the CUAHSI Water Data Center
Cooperative Agreement (NSF Award 1248152).