You are on page 1of 1

Rethinking the CUAHSI HIS Catalog

Alva Couch, Tufts University; Martin Seul, Yaping Xiao, Richard Hooper, CUAHSI
{acouch,mseul,yxiao,rhooper}@cuahsi.org

ABSTRACT

ARCHITECTURE

The CUAHSI Hydrologic Information System (HIS) Central Catalog is currently based
upon the structure of a data server (without the data).

raw
metadata

searchable
metadata

This limits its utility and ability to utilize state-of-the-art catalog technologies.
We have undertaken a ground-up redesign of the catalog in order to:

Data
sources

SOLR
Search
Engine
(Linux)

Harvesting

1. Enable new kinds of data discovery.


2. Support new data types other than time series at a location in space.

There have been significant difficulties in trying to update the existing catalog for
Changes in metadata.

searchable
documents
describing data
resources
Matching
documents

Metadata
Document
Store

SOLR query
results

3. Enable quick response to new user needs.

LIMITATIONS

CULTURE SHOCK!

data
discovery
clients

WOFS and OGC


query results

blue: we provide
green: provided by SOLR
brown: external

Query
Front End

CONFIGURING THE HARVESTER

New data types other than time series.


Consistently acceptable performance.

Some reasons for the difficulties include:


The need to fit new metadata into a predetermined metadata schema.
Thus, the need to predict in advance what will be needed in the future.

what metadata we wish


to collect; how searching
should work.

schema.xml

where to find metadata


in XML or SQL sources

This led to many tables, views, and functions of unknown provenance and use.

dataconfig.xml

SQL performance tuning has proven ineffective at solving performance problems.

OBJECTIVES
High modularity: catalogs for different data types do not interact or conflict.
Scalability to immense numbers of data sets and an unlimited number of new data
types.

SOLR Data
Import
Handler

data
sources

Metadata
Document
Store

including web services,


XML, SQL, ...

CONFIGURING THE FRONT END

STRATEGIES
Use off-the-shelf catalog technologies.
Record raw metadata from the information source rather than fitting metadata into
existing schemas.

Formulate
SOLR query

SOLR

determines
content of result

SOLR XML
OUTPUT

Select XSLT
specification

XSLT
transducer

determines
format of result

Desired
XML result

CRITIQUE

CHOICES

SOLR simplifies search and discovery, at the expense of redundant storage and/or
redundant queries.

Use search engine specifically suited for creating catalogs.

SOLR metadata documents must be detailed enough to contain everything a


searcher might need, to assure good performance.

1. ElasticSearch.
2. MarkLogic.
3. SOLR our choice for now.

Linux

SQL

NoSQL

Tables

Collections of metadata documents

Write a harvester

Configure built-in harvesters

Code in C# and XML

Configure in XML and XSLT

Write XML from SQL results

Transform internal XML results to appropriate


outgoing XML

Create indexes to improve performance

Specify search types; indexes are created


appropriately

Normalize tables and use joins

No joins: one collection per data type to be


returned *

Change columns of tables

Only whole documents can be changed

Design XML and SQL schemas

New schema type just for SOLR data

Faceted search really difficult

Faceted search built in

* What SOLR calls a join is not a join at all; it is a kind of filter that uses data from other
collections in filtering. In some circles this is called a join filter rather than a join.

CONCLUSIONS
One cannot eliminate complexity from a project, but one can move it around.
In adopting SOLR rather than SQL for the catalog, we moved complexity to where it
counts:
1. Capturing the proper metadata: by configuring the SOLR harvester.

SOLR does everything else.

HOW YOU CAN HELP


Join us in experimenting with SOLR catalogs
Find the queries that help users.

Write a front end that translates raw metadata into useful forms.

Potential choices included:

Windows

3. Formatting output: into formats that are in common use.

Faceted search to allow more efficient data discovery.


Quick adaptability in cataloging new data types or recording new metadata.

New approach (SOLR)

2. Searching the proper metadata: by specifying queries in ways that are


understandable to users.

user format
requirements

user data
query

Old approach (database)

Thus, it may be necessary to record the same metadata in several places within
SOLR to enable different kinds of searches.
It is often necessary to precompute relationships between documents that one
needs.

Explore catalogs for new kinds of data.

ACKNOWLEDGEMENTS
Caleb Malchik, Yunjie Lu, Janeth Jepkogei (Tufts University students)
This work was supported in part by the CUAHSI Water Data Center
Cooperative Agreement (NSF Award 1248152).

You might also like