Big, Bigger Biggest: Tom Burton West Hathi Trust Project

Big, Bigger
Large scale issues:

Biggest
Phrase queries and common
words OCR
Tom Burton West

Hathi Trust Project
Lucid Imagination, Inc.

http://www.lucidimagination.com
Hathi Trust Large Scale Search

Challenges
Goal: Design a system for full-text search that
will scale to 5 million to 20 million volumes (at a
reasonable cost.)
Challenges:
Must scale to 20 million full-text volumes
Very long documents compared to
most large-scale search applications
Multilingual collection
OCR quality varies

Index Size, Caching, and Memory

Our documents average about 300 pages
which is about 700KB of OCR.
Our 5 million document index is between 2 and 3 terabytes.
About 300 GB per million documents
Large index means disk I/O is bottleneck
Tradeoff JVM vs OS memory
Solr uses OS memory (disk I/O caching) for caching of postings
Memory available for disk I/O caching has most impact on
response time (assuming adequate cache warming)
Fitting entire index in memory not feasible with terabyte size index

Response time varies with query

Average:
673
Median:
91
90th:
328
99th:
7,504

Slowest 5 % of queriesThe slowest 5% of queries took

about 1 second or longer.
The slowest 1% of queries took
between 10 seconds and 2
minutes.
Slowest 0.5% of queries took
between 30 seconds and 2
minutes
These queries affect response
time of other queries
Cache pollution
Contention for resources
Slowest queries are phrase
queries containing common words

Query processing
Phrase queries use position index (Boolean queries do
not).
Position index accounts for 85% of index size
Position list for common words such as
the can be many GB in size
This causes lots of disk I/O .
Solr depends on the operating systems disk cache to
reduce disk I/O requirements for words that occur in
more than one query
I/O from Phrase queries containing
common words pollutes the cache
Slow Queries
Slowest test query: the lives and literature of the beat
generation took 2 minutes.
4MB data read for Boolean query.
9,000+ MB read for Phrase query.
NUMBER OF
DOCUMENTS
WORD
POSTINGS
LIST
(SIZE MB)
TOTAL TERM
OCCURRENCES
(MILLIONS)
POSITION LIST
(SIZE MB)
the
800,000
0.8
4,351
4,351
of
892,000
0.89
2,795
2,795
and
769,000
0.77
1,870
1,870
literature
435,000
0.44
generati
on
414,000
0.41
lives
432,000
0.43
beat
278,000
0.28
TOTAL
4.02

9,036
7
Why not use Stop Words?

The word the occurs more than 4 billion times in our 1 million
document index.
Removing stop words (the, of etc.) not desirable for our use
cases.
Couldnt search for many phrases
to be or not to be
the who
man in the moon vs. man on the moon
Stop words in one language are content words in another language

German stop words war and die are content words in
English
English stop words is and by are content words (ice
and village) in Swedish
CommonGrams
Ported Nutch CommonGrams algorithm to Solr
Create Bi-Grams selectively for any two word
sequence containing common terms
Slowest query: The lives and literature of the beat
generation
the-lives
lives-and
and-literature
of-the
literature-of
the-beat
generation

Standard index vs. CommonGrams

Standard Index
WORD
TOTAL
OCCURREN
CES IN
CORPUS
(MILLIONS)
Common Grams
NUMBER
OF DOCS
(THOUSAN
DS)
TERM
the
2,013
386
of-the
of
1,299
440
855
literature
TOTAL
NUMBER OF
OCCURRENC
DOCS
ES IN
(THOUSAND
CORPUS
S)
(MILLIONS)
446
396
generation
2.42
262
376
the-lives
0.36
128
210
lives
194
literatureof
0.35
103
generation
199
lives-and
0.25
115
0.6
4,176
130
andliterature
0.24
77
the-beat
0.06
26
TOTAL
450
and
beat
TOTAL

10
Comparison of Response
time (ms)
AVERA
GE
MEDIAN
Standard
Index
459
32
146
6,784
120,595
Common
Grams
68
71
2,226
7,800

90
th
99
th
SLOWEST
QUERY
11
Other issues
Analyze your slowest queries
We analyzed the slowest queries from our query
logs and discovered additional common words to
be added to our list.
We used Solr Admin panel to run our slowest queries
from our logs with the debug flag checked.
We discovered that words such as lart were
being split into two token phrase queries.
We used the Solr Admin Analysis tool and
determined that the analyzer we were using was
the culprit.
12
Other issues
We broke Solr temporarily
Dirty OCR in combination with over 200 languages
creates indexes with over 2.4 billion unique terms
Solr/Lucene index size was limited to 2.1 Billion unique
terms
Patched: Now its 274 Billion
Dirty OCR is difficult to remove without removing
good words.
Because Solr/Lucene tii/tis index uses pointers into the
frequency and position files we suspect that the
performance impact is minimal compared to disk I/O
demands, but we will be testing soon.
13

Big, Bigger Biggest: Tom Burton West Hathi Trust Project

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big, Bigger Biggest: Tom Burton West Hathi Trust Project

Uploaded by

Copyright:

Available Formats

Big, Bigger

Large scale issues:

Tom Burton West

Lucid Imagination, Inc.

Hathi Trust Large Scale Search

Lucid Imagination, Inc.

Index Size, Caching, and Memory

Lucid Imagination, Inc.

Response time varies with query

Lucid Imagination, Inc.

Slowest 5 % of queriesThe slowest 5% of queries took

Lucid Imagination, Inc.

Lucid Imagination, Inc.

Why not use Stop Words?

Stop words in one language are content words in another language

Lucid Imagination, Inc.

Standard index vs. CommonGrams

Lucid Imagination, Inc.

Lucid Imagination, Inc.

You might also like