A Social Model of Hydrologic Data Protection and Sharing

Rethinking the CUAHSI HIS Catalog
Alva Couch, Tufts University; Martin Seul, Yaping Xiao, Richard Hooper, CUAHSI
{acouch,mseul,yxiao,rhooper}@cuahsi.org
ABSTRACT
ARCHITECTURE
The CUAHSI Hydrologic Information System (HIS) Central Catalog is currently based
upon the structure of a data server (without the data).
raw
metadata
searchable
metadata
This limits its utility and ability to utilize state-of-the-art catalog technologies.
We have undertaken a ground-up redesign of the catalog in order to:
Data
sources
SOLR
Search
Engine
(Linux)
Harvesting
1. Enable new kinds of data discovery.

2. Support new data types other than time series at a location in space.
There have been significant difficulties in trying to update the existing catalog for
Changes in metadata.
searchable
documents
describing data
resources
Matching
documents
Metadata
Document
Store
SOLR query
results
3. Enable quick response to new user needs.
LIMITATIONS
CULTURE SHOCK!
data
discovery
clients
WOFS and OGC

query results
blue: we provide
green: provided by SOLR
brown: external
Query
Front End
CONFIGURING THE HARVESTER
New data types other than time series.

Consistently acceptable performance.
Some reasons for the difficulties include:

The need to fit new metadata into a predetermined metadata schema.
Thus, the need to predict in advance what will be needed in the future.
what metadata we wish

to collect; how searching
should work.
schema.xml
where to find metadata

in XML or SQL sources
This led to many tables, views, and functions of unknown provenance and use.
dataconfig.xml
SQL performance tuning has proven ineffective at solving performance problems.
OBJECTIVES
High modularity: catalogs for different data types do not interact or conflict.
Scalability to immense numbers of data sets and an unlimited number of new data
types.
SOLR Data
Import
Handler
data
sources
Metadata
Document
Store
including web services,

XML, SQL, ...
CONFIGURING THE FRONT END
STRATEGIES
Use off-the-shelf catalog technologies.
Record raw metadata from the information source rather than fitting metadata into
existing schemas.
Formulate
SOLR query
SOLR
determines
content of result
SOLR XML
OUTPUT
Select XSLT
specification
XSLT
transducer
determines
format of result
Desired
XML result
CRITIQUE
CHOICES
SOLR simplifies search and discovery, at the expense of redundant storage and/or
redundant queries.
Use search engine specifically suited for creating catalogs.
SOLR metadata documents must be detailed enough to contain everything a

searcher might need, to assure good performance.
1. ElasticSearch.
2. MarkLogic.
3. SOLR our choice for now.
Linux
SQL
NoSQL
Tables
Collections of metadata documents
Write a harvester
Configure built-in harvesters
Code in C# and XML
Configure in XML and XSLT
Write XML from SQL results
Transform internal XML results to appropriate

outgoing XML
Create indexes to improve performance
Specify search types; indexes are created

appropriately
Normalize tables and use joins
No joins: one collection per data type to be

returned *
Change columns of tables
Only whole documents can be changed
Design XML and SQL schemas
New schema type just for SOLR data
Faceted search really difficult
Faceted search built in
* What SOLR calls a join is not a join at all; it is a kind of filter that uses data from other
collections in filtering. In some circles this is called a join filter rather than a join.
CONCLUSIONS
One cannot eliminate complexity from a project, but one can move it around.
In adopting SOLR rather than SQL for the catalog, we moved complexity to where it
counts:
1. Capturing the proper metadata: by configuring the SOLR harvester.
SOLR does everything else.
HOW YOU CAN HELP

Join us in experimenting with SOLR catalogs
Find the queries that help users.
Write a front end that translates raw metadata into useful forms.
Potential choices included:
Windows
3. Formatting output: into formats that are in common use.
Faceted search to allow more efficient data discovery.

Quick adaptability in cataloging new data types or recording new metadata.
New approach (SOLR)
2. Searching the proper metadata: by specifying queries in ways that are

understandable to users.
user format
requirements
user data
query
Old approach (database)
Thus, it may be necessary to record the same metadata in several places within
SOLR to enable different kinds of searches.
It is often necessary to precompute relationships between documents that one
needs.
Explore catalogs for new kinds of data.
ACKNOWLEDGEMENTS
Caleb Malchik, Yunjie Lu, Janeth Jepkogei (Tufts University students)
This work was supported in part by the CUAHSI Water Data Center
Cooperative Agreement (NSF Award 1248152).

A Social Model of Hydrologic Data Protection and Sharing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Social Model of Hydrologic Data Protection and Sharing

Uploaded by

Copyright:

Available Formats

Rethinking the CUAHSI HIS Catalog

1. Enable new kinds of data discovery.

3. Enable quick response to new user needs.

WOFS and OGC

CONFIGURING THE HARVESTER

New data types other than time series.

Some reasons for the difficulties include:

what metadata we wish

where to find metadata

SQL performance tuning has proven ineffective at solving performance problems.

including web services,

CONFIGURING THE FRONT END

Use search engine specifically suited for creating catalogs.

SOLR metadata documents must be detailed enough to contain everything a

Collections of metadata documents

Configure built-in harvesters

Code in C# and XML

Configure in XML and XSLT

Write XML from SQL results

Transform internal XML results to appropriate

Create indexes to improve performance

Specify search types; indexes are created

Normalize tables and use joins

No joins: one collection per data type to be

Change columns of tables

Only whole documents can be changed

Design XML and SQL schemas

New schema type just for SOLR data

Faceted search really difficult

Faceted search built in

SOLR does everything else.

HOW YOU CAN HELP

Potential choices included:

3. Formatting output: into formats that are in common use.

Faceted search to allow more efficient data discovery.

New approach (SOLR)

2. Searching the proper metadata: by specifying queries in ways that are

Old approach (database)

Explore catalogs for new kinds of data.

You might also like