You are on page 1of 16

The Million Book Project

The Mini-UL Digital Library Platform


Carnegie Mellon University
School of Computer Science
Raj Reddy
Eric Burns

What is the Million Book Project?

Free-to-read, open-platform digital library

Worldwide distribution and mirroring


Public domain works
Out of print but in copyright
Rare materials

Collaborative content acquisition

India

China

Over 30,000 books to date

USA / Carnegie Mellon (Hunt Library/SCS)

20 mini scanning centers, 3 mega scanning centers


Over 80,000 books to date

1200 books, technology contributor

Truly multi-lingual corpus

Several Indian languages


Mandarin Chinese
Most European languages

MBP offers unique systems challenges

Multiple deployments

China
India
Partners in US

Human-intensive scanning process

Error prone

Difficult to standardize
Multiple QA passes required

Everyone wants autonomy and customization

DC XML entered by hand


Operator error on scanning devices

System-level solution must satisfy small and large data sets


CMU must provide a framework for remote sites to extend
Equipment budget is limited

Developing nations networks are limited

China, India output must be shipped to US

Core Problems

Multiple scanning centers, each with:

Common base requirements

Distinct values and goals


Limited connectivity
Varying IT infrastructure
Searching
Browsing
Viewing
File-system compatibility

Basic standard for acquiring and storing scanned books

Data preservation
Quality assurance
Flexibility
Openness

Fault-tolerant storage at all sites


Data movement via physical shipment
Standardized OS and base software

Our Solution: Mini-UL Embedded

Digital library on a CD

OS (Knoppix Linux), servers (Apache, PaperSight ImageServer), code


(Perl) on single ISO
Boots single systems or whole clusters
Ensures standardization, eases upgrades

Commodity PC and disk hardware spec

Software RAID: Use low-end PC as network-attach storage


Sub-$1000 PC = 1 TB NASD

To use new software, admins burn CD and reboot

Barebones economy PC
250 GB OEM disk x 4

Add storage PCs as needs grow


1 processor per storage unit

CD + PC(s) = Embedded digital library

Black box approach

Dump MBP-format books into upload bucket


Easily search, browse, view, and download all books added

The MBP Book Format

Dir w/ five subdirectories:

OTIFF

PTIFF

ASCII, UTF-8, or UTF-16 text


Numbers match OTIFF/PTIFF

HTML

Processed TIFF: current best batch image processing


Eight-digit zero-padded numbers match OTIFF

TXT

Original TIFF: exactly as scanned


Eight-digit, zero-padded page numbers (00000123.tif)
1-bit color at 600 DPI, lossless

UTF-8 HTML w/ low-res JPEG images


Numbers match OTIFF/PTIFF

[MARC|DC]

Binary MARC record


Dublin Core XML

Flexible: other format directories can be added


Internal storage format:

OTIFF/PTIFF -> multipage


TXT/HTML -> zip
500 page book = 2001 files
Converted at addition time to 5 files
Speeds copying

High-level Cluster Architecture

Web traffic

Head
Node
(NASD 0)

Internal network subnet

NASD
1

NASD
2

NASD

NASD
n

Network-Attach Storage Devices (SATA RAID PCs)

Adding a Book

Head node has SMB share Upload


User moves one or more MBP-format books into Upload
share
System automatically checks each book for
completeness/correctness:

All formats present


Contiguous page numbers
Metadata present and parseable
Errors presented to user for correction

Converts to internal storage format


Assigns serial number
Moves to NASD node with most free space
Incremental search index

Viewing a book

Users view original page images

Intra-book searching

HTML, raw TXT as option


Seeks to matching page
Highlights token match
Rapidly seek from one token match to the next
Boolean queries, phrase matching

PaperSight ImageServer

Convert 600 DPI 1-bit TIFF to ~96DPI 8-bit GIF


Real-time conversion performance is faster than human
response
Anti-aliased grey-scale image is ideal for monitor reading
Significant reduction in bandwidth
Conversion happens on hosting NASD node, not head

Browsing
Simple

alphabetic browse
Keep list sizes small

The Missing Piece: Search


Searching the full text of tens of thousands of books is computationally
intensive
Solution: parallelize

Each NASD node indexes and searches content it stores


Results are unified and sorted at head node
NASD cluster architecture maintains parity between processors and storage

Search features

Fast! 0.1 sec per-token response in most cases (AMD 1400+).


Joint bibliographic and full-text search with single query
Phrase matching, boolean queries, cross-page phrases
Context display for full-text matches
Rich scoring system:

Metadata matches
Token proximity scoring (multi-token queries only)

Direct-to-page matching

Grow from n to 2n nodes? Search speed remains constant (assuming homogeneous


corpus)
Search too slow? Increase machine count and redistribute data.

Full text matches yield actual matching page, with highlighting

Full search API (Perl)

Customization
APIs

All

provided for all major components:

Search
Book Reader
Metadata processing and conversion

HTML lives in read-write space on head node


Development sites can create rich HTML hierarchies

Scripting

is not limited to CD contents

cgi-bin and site_perl can be extended

CD/core

upgrades leave extensions untouched

Future Directions
Search

GPL
Perl CPAN

Phone

Metadata + text

Master site to search all sites

IIIT

Home capability

Individual Mini-UL systems with slow but persistent links


relay manifests

engine in wider distribution

Hyderabad contributions

MySQL-based metadata search

Separate

search and storage clusters

9 TB hardware RAID servers


Multiple diskless search nodes

Embedded Digital Library Uses


Gives

Allows convergence on standards as sites contribute


new extensions to main distribution
Gives basic search, browse, view, and audit capability to
any site, regardless of development staff

Uses

MBP sites foundation on which to build

extend beyond MBP deployments

Any site with archives of multi-page text documents can


benefit
Only requirements are a scanner and a PC
Virtually no administration required

Questions?

You might also like