The Million Book Project

The Million Book Project
The Mini-UL Digital Library Platform

Carnegie Mellon University
School of Computer Science
Raj Reddy
Eric Burns
What is the Million Book Project?
Free-to-read, open-platform digital library
Worldwide distribution and mirroring

Public domain works
Out of print but in copyright
Rare materials
Collaborative content acquisition
India
China
Over 30,000 books to date
USA / Carnegie Mellon (Hunt Library/SCS)
20 mini scanning centers, 3 mega scanning centers

Over 80,000 books to date
1200 books, technology contributor
Truly multi-lingual corpus
Several Indian languages

Mandarin Chinese
Most European languages
MBP offers unique systems challenges
Multiple deployments
China
India
Partners in US
Human-intensive scanning process
Error prone
Difficult to standardize
Multiple QA passes required
Everyone wants autonomy and customization
DC XML entered by hand

Operator error on scanning devices
System-level solution must satisfy small and large data sets

CMU must provide a framework for remote sites to extend
Equipment budget is limited
Developing nations networks are limited
China, India output must be shipped to US
Core Problems
Multiple scanning centers, each with:
Common base requirements
Distinct values and goals

Limited connectivity
Varying IT infrastructure
Searching
Browsing
Viewing
File-system compatibility
Basic standard for acquiring and storing scanned books
Data preservation
Quality assurance
Flexibility
Openness
Fault-tolerant storage at all sites

Data movement via physical shipment
Standardized OS and base software
Our Solution: Mini-UL Embedded
Digital library on a CD
OS (Knoppix Linux), servers (Apache, PaperSight ImageServer), code

(Perl) on single ISO
Boots single systems or whole clusters
Ensures standardization, eases upgrades
Commodity PC and disk hardware spec
Software RAID: Use low-end PC as network-attach storage

Sub-$1000 PC = 1 TB NASD
To use new software, admins burn CD and reboot
Barebones economy PC
250 GB OEM disk x 4
Add storage PCs as needs grow

1 processor per storage unit
CD + PC(s) = Embedded digital library
Black box approach
Dump MBP-format books into upload bucket

Easily search, browse, view, and download all books added
The MBP Book Format
Dir w/ five subdirectories:
OTIFF
PTIFF
ASCII, UTF-8, or UTF-16 text

Numbers match OTIFF/PTIFF
HTML
Processed TIFF: current best batch image processing

Eight-digit zero-padded numbers match OTIFF
TXT
Original TIFF: exactly as scanned

Eight-digit, zero-padded page numbers (00000123.tif)
1-bit color at 600 DPI, lossless
UTF-8 HTML w/ low-res JPEG images

Numbers match OTIFF/PTIFF
[MARC|DC]
Binary MARC record

Dublin Core XML
Flexible: other format directories can be added

Internal storage format:
OTIFF/PTIFF -> multipage

TXT/HTML -> zip
500 page book = 2001 files
Converted at addition time to 5 files
Speeds copying
High-level Cluster Architecture
Web traffic
Head
Node
(NASD 0)
Internal network subnet
NASD
1
NASD
2
NASD
NASD
n
Network-Attach Storage Devices (SATA RAID PCs)
Adding a Book
Head node has SMB share Upload

User moves one or more MBP-format books into Upload
share
System automatically checks each book for
completeness/correctness:
All formats present

Contiguous page numbers
Metadata present and parseable
Errors presented to user for correction
Converts to internal storage format

Assigns serial number
Moves to NASD node with most free space
Incremental search index
Viewing a book
Users view original page images
Intra-book searching
HTML, raw TXT as option

Seeks to matching page
Highlights token match
Rapidly seek from one token match to the next
Boolean queries, phrase matching
PaperSight ImageServer
Convert 600 DPI 1-bit TIFF to ~96DPI 8-bit GIF

Real-time conversion performance is faster than human
response
Anti-aliased grey-scale image is ideal for monitor reading
Significant reduction in bandwidth
Conversion happens on hosting NASD node, not head
Browsing
Simple
alphabetic browse
Keep list sizes small
The Missing Piece: Search

Searching the full text of tens of thousands of books is computationally
intensive
Solution: parallelize
Each NASD node indexes and searches content it stores

Results are unified and sorted at head node
NASD cluster architecture maintains parity between processors and storage
Search features
Fast! 0.1 sec per-token response in most cases (AMD 1400+).

Joint bibliographic and full-text search with single query
Phrase matching, boolean queries, cross-page phrases
Context display for full-text matches
Rich scoring system:
Metadata matches
Token proximity scoring (multi-token queries only)
Direct-to-page matching
Grow from n to 2n nodes? Search speed remains constant (assuming homogeneous

corpus)
Search too slow? Increase machine count and redistribute data.
Full text matches yield actual matching page, with highlighting
Full search API (Perl)
Customization
APIs
All
provided for all major components:
Search
Book Reader
Metadata processing and conversion
HTML lives in read-write space on head node

Development sites can create rich HTML hierarchies
Scripting
is not limited to CD contents
cgi-bin and site_perl can be extended
CD/core
upgrades leave extensions untouched
Future Directions
Search
GPL
Perl CPAN
Phone
Metadata + text
Master site to search all sites
IIIT
Home capability
Individual Mini-UL systems with slow but persistent links

relay manifests
engine in wider distribution
Hyderabad contributions
MySQL-based metadata search
Separate
search and storage clusters
9 TB hardware RAID servers

Multiple diskless search nodes
Embedded Digital Library Uses

Gives
Allows convergence on standards as sites contribute

new extensions to main distribution
Gives basic search, browse, view, and audit capability to
any site, regardless of development staff
Uses
MBP sites foundation on which to build
extend beyond MBP deployments
Any site with archives of multi-page text documents can

benefit
Only requirements are a scanner and a PC
Virtually no administration required
Questions?

The Million Book Project

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Million Book Project

Uploaded by

Copyright:

Available Formats

The Million Book Project

The Mini-UL Digital Library Platform

What is the Million Book Project?

Free-to-read, open-platform digital library

Worldwide distribution and mirroring

Collaborative content acquisition

Over 30,000 books to date

USA / Carnegie Mellon (Hunt Library/SCS)

20 mini scanning centers, 3 mega scanning centers

1200 books, technology contributor

Truly multi-lingual corpus

Several Indian languages

MBP offers unique systems challenges

Human-intensive scanning process

Everyone wants autonomy and customization

DC XML entered by hand

System-level solution must satisfy small and large data sets

Developing nations networks are limited

China, India output must be shipped to US

Multiple scanning centers, each with:

Common base requirements

Distinct values and goals

Basic standard for acquiring and storing scanned books

Fault-tolerant storage at all sites

Our Solution: Mini-UL Embedded

OS (Knoppix Linux), servers (Apache, PaperSight ImageServer), code

Commodity PC and disk hardware spec

Software RAID: Use low-end PC as network-attach storage

To use new software, admins burn CD and reboot

Add storage PCs as needs grow

CD + PC(s) = Embedded digital library

Black box approach

Dump MBP-format books into upload bucket

The MBP Book Format

Dir w/ five subdirectories:

ASCII, UTF-8, or UTF-16 text

Processed TIFF: current best batch image processing

Original TIFF: exactly as scanned

UTF-8 HTML w/ low-res JPEG images

Binary MARC record

Flexible: other format directories can be added

OTIFF/PTIFF -> multipage

High-level Cluster Architecture

Internal network subnet

Network-Attach Storage Devices (SATA RAID PCs)

Head node has SMB share Upload

All formats present

Converts to internal storage format

Users view original page images

HTML, raw TXT as option

Convert 600 DPI 1-bit TIFF to ~96DPI 8-bit GIF

The Missing Piece: Search

Each NASD node indexes and searches content it stores

Fast! 0.1 sec per-token response in most cases (AMD 1400+).

Grow from n to 2n nodes? Search speed remains constant (assuming homogeneous

Full text matches yield actual matching page, with highlighting

Full search API (Perl)

provided for all major components:

HTML lives in read-write space on head node

is not limited to CD contents

cgi-bin and site_perl can be extended

upgrades leave extensions untouched

Master site to search all sites

Individual Mini-UL systems with slow but persistent links