You are on page 1of 13

Big, Bigger

Large scale issues:


Biggest
Phrase queries and common
words OCR

Tom Burton West


Hathi Trust Project

Lucid Imagination, Inc.


http://www.lucidimagination.com

Hathi Trust Large Scale Search


Challenges
Goal: Design a system for full-text search that
will scale to 5 million to 20 million volumes (at a
reasonable cost.)
Challenges:
Must scale to 20 million full-text volumes
Very long documents compared to
most large-scale search applications
Multilingual collection
OCR quality varies

Lucid Imagination, Inc.


http://www.lucidimagination.com

Index Size, Caching, and Memory


Our documents average about 300 pages
which is about 700KB of OCR.
Our 5 million document index is between 2 and 3 terabytes.
About 300 GB per million documents
Large index means disk I/O is bottleneck
Tradeoff JVM vs OS memory
Solr uses OS memory (disk I/O caching) for caching of postings
Memory available for disk I/O caching has most impact on
response time (assuming adequate cache warming)

Fitting entire index in memory not feasible with terabyte size index

Lucid Imagination, Inc.


http://www.lucidimagination.com

Response time varies with query


Average:
673
Median:
91
90th:
328
99th:
7,504

Lucid Imagination, Inc.


http://www.lucidimagination.com

Slowest 5 % of queriesThe slowest 5% of queries took


about 1 second or longer.
The slowest 1% of queries took
between 10 seconds and 2
minutes.
Slowest 0.5% of queries took
between 30 seconds and 2
minutes
These queries affect response
time of other queries
Cache pollution
Contention for resources
Slowest queries are phrase
queries containing common words

Lucid Imagination, Inc.


http://www.lucidimagination.com

Query processing
Phrase queries use position index (Boolean queries do
not).
Position index accounts for 85% of index size
Position list for common words such as
the can be many GB in size
This causes lots of disk I/O .
Solr depends on the operating systems disk cache to
reduce disk I/O requirements for words that occur in
more than one query
I/O from Phrase queries containing
common words pollutes the cache
Lucid Imagination, Inc.
http://www.lucidimagination.com

Slow Queries
Slowest test query: the lives and literature of the beat
generation took 2 minutes.
4MB data read for Boolean query.
9,000+ MB read for Phrase query.
NUMBER OF
DOCUMENTS

WORD

POSTINGS
LIST
(SIZE MB)

TOTAL TERM
OCCURRENCES
(MILLIONS)

POSITION LIST
(SIZE MB)

the

800,000

0.8

4,351

4,351

of

892,000

0.89

2,795

2,795

and

769,000

0.77

1,870

1,870

literature

435,000

0.44

generati
on

414,000

0.41

lives

432,000

0.43

beat

278,000

0.28

TOTAL

4.02

Lucid Imagination, Inc.


http://www.lucidimagination.com

9,036
7

Why not use Stop Words?


The word the occurs more than 4 billion times in our 1 million
document index.
Removing stop words (the, of etc.) not desirable for our use
cases.
Couldnt search for many phrases
to be or not to be
the who
man in the moon vs. man on the moon

Stop words in one language are content words in another language


German stop words war and die are content words in
English
English stop words is and by are content words (ice
and village) in Swedish
Lucid Imagination, Inc.
http://www.lucidimagination.com

CommonGrams
Ported Nutch CommonGrams algorithm to Solr
Create Bi-Grams selectively for any two word
sequence containing common terms
Slowest query: The lives and literature of the beat
generation
the-lives

lives-and

and-literature
of-the

literature-of

the-beat

generation

Lucid Imagination, Inc.


http://www.lucidimagination.com

Standard index vs. CommonGrams


Standard Index
WORD

TOTAL
OCCURREN
CES IN
CORPUS
(MILLIONS)

Common Grams
NUMBER
OF DOCS
(THOUSAN
DS)

TERM

the

2,013

386

of-the

of

1,299

440

855

literature

TOTAL
NUMBER OF
OCCURRENC
DOCS
ES IN
(THOUSAND
CORPUS
S)
(MILLIONS)

446

396

generation

2.42

262

376

the-lives

0.36

128

210

lives

194

literatureof

0.35

103

generation

199

lives-and

0.25

115

0.6
4,176

130

andliterature

0.24

77

the-beat

0.06

26

TOTAL

450

and

beat
TOTAL

Lucid Imagination, Inc.


http://www.lucidimagination.com

10

Comparison of Response
time (ms)
AVERA
GE

MEDIAN

Standard
Index

459

32

146

6,784

120,595

Common
Grams

68

71

2,226

7,800

Lucid Imagination, Inc.


http://www.lucidimagination.com

90

th

99

th

SLOWEST
QUERY

11

Other issues
Analyze your slowest queries
We analyzed the slowest queries from our query
logs and discovered additional common words to
be added to our list.
We used Solr Admin panel to run our slowest queries
from our logs with the debug flag checked.
We discovered that words such as lart were
being split into two token phrase queries.
We used the Solr Admin Analysis tool and
determined that the analyzer we were using was
the culprit.
Lucid Imagination, Inc.
http://www.lucidimagination.com

12

Other issues
We broke Solr temporarily
Dirty OCR in combination with over 200 languages
creates indexes with over 2.4 billion unique terms
Solr/Lucene index size was limited to 2.1 Billion unique
terms
Patched: Now its 274 Billion
Dirty OCR is difficult to remove without removing
good words.
Because Solr/Lucene tii/tis index uses pointers into the
frequency and position files we suspect that the
performance impact is minimal compared to disk I/O
demands, but we will be testing soon.
Lucid Imagination, Inc.
http://www.lucidimagination.com

13

You might also like