Katta - Distributed Lucene Index in Production

1
Katta & Hadoop

Katta - Distributed Lucene Index in Production
Stefan Groschupf
Scale Unlimited, 101tec.
sg{at}101tec.com
foto by: belgianchocolate@flickr.com
2
Intro
• Business intelligence reports from event stream
• Existing event stream processing platform V1
• Build on top of oracle
• Scale problems
• Expensive
• Slow
• Hugh star schema
• New report expensiv to develop
• Expensive to keep old data

3
Goals
• Build next generation platform for event stream processing
• Faster report development - plugins
• Reduce total coast of ownership
• No license fees, open source based
• Commodity hardware
• Lower maintenance coasts
• Better scalable
• Better performance
• Cheap storage
4
Challenge
• Integrate system into big picture
• Log data via JMS
• Report WebApp uses jdbc
• Report developers do not know

Map Redcuce but SQL, XPath etc.
• Which format store data in?
• Which format process records in?
• Where store processing results in?

5
Challenges II
• Teleskop to Microscope - zoom to log record level
• One report - many mr jobs
• Job Scheduling
• Enterprise 24/7 monitoring - SNMP
• Work with open source releases cycles

6
Our Solution I
Monitor and manage

everything
Distributed index for
log message retrival
Web -
Console
Customer Userinterface
Files by organized 8%
7%
by day Aggregate data and 10%
35%
generate report data Katta
11%
29%
binary feed Hadoop MR Pig

Web Page
JMS MSG
JMS MSG
JMS MSG
DFS Database
Convert logs to
measures Store results of
pig queries
7
Our Solution II
SQL
Binary tree format xml > tuples text tuples
Schema
JMS DFS MR PIG DB

8
Katta
• Serving indexes the hadoop • Lightweight
distributed file system way
• Master fail over
• Index as index shards on many
servers • Fast*
• Replicate shards on different • Easy to integrate

servers for performance and fault-
tolerance • Plays well with hadoop clusters
• Apache Version 2 License

9
Contras
• No realtime updates like Solr,
Couch DB or Cassandra yet*
* though on roadmap
• Index serving tool, not indexer

10
What is a Katta index?

Katta Index
Lucene Index
Lucene Index
Lucene Index
• Folder with Lucene indexes
• Shard Indexes can be zipped

11
hadoop cluster or
single server
Overview
<REST API/> *
HDFS, NAS or shared

create index local filesystem
and copy to shared filesystem
fail over
command line
management
Secondary
Master
Master
java API Zookeeper Zookeeper
assign download
shards shards
server nodes in the
grid
Node Node Node Node
multicast query
shard replication
(plug-able policy) multicast query distributed ranking
plug-able selection
policy (custom load
balancing)
java client API

12
CLI
13
API
14
Lucene Queries
• title:"The Right Way" AND text:go
• te?t or test* or te*t
• mod_date:[20020101 TO 20030101]
• state:CA AND age:[1 TO 15] AND product:ipod
• state:CA AND age:[16 TO 21] AND product:ipod

15
Teleskop to Microscope
• Create Index from XML in MR
stage
• Deploy indexes in katta
• Merge indexes frequently together
• Find documents by key
• Find documents by query

16
XML to Lucene Document
<event id=”aKey” type=”sell”>

<product id=”ipod”/>
<user id=”stefan” state=”CA” age=”31”/>
</event>
/event/@id:aKey
/event/@type:sell
/event/product/@id:ipod
/event/user/@id:stefan
/event/user/@state:CA
/event/user/@ age:31
17
Range Queries
/event/product/@id:ipod AND /event/user/@state:CA AND
/event/user/@ age:[001 TO 010]
Counting results -> one network round trip

18
Range Queries Result Graph
60,000
45,000
30,000
15,000
0
01-10 11-20 21-30 31-40
19
Pros
• Easy reports can be generated • System scales
from katta index
• Scaling is cheap
• Complex reports generated with
many pig statements (>30 job) • We keep more data
• Zoom in data from complex • Report developing is easy

reports
20
Problems
• There was no cascading, hive or
jaql, pig was very young
• Develop against changing open

source project (hadoop, pig)
• Pig is/was slow (always text) and

(was) buggy
• Katta indexes need to merged

frequently
• Monitoring and management

21
Roadmap
• 0.1 released
• 0.2 Hadoop 0.17
• 0.3 Hadoop 0.18
• Performance improvements
• EC2 support
• Add realtime update support
• Not yet clear how exactly
• Might be similar to Dynamo

22
Thanks
katta.sourceforge.net
sg{at}101tec.com

Katta - Distributed Lucene Index in Production

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Katta - Distributed Lucene Index in Production

Uploaded by

Copyright:

Available Formats

1

Katta & Hadoop

• Existing event stream processing platform V1

• Build on top of oracle

• Hugh star schema

• New report expensiv to develop

• Expensive to keep old data

• Faster report development - plugins

• Reduce total coast of ownership

• No license fees, open source based

• Lower maintenance coasts

• Log data via JMS

• Report WebApp uses jdbc

• Report developers do not know

• Which format store data in?

• Which format process records in?

• Where store processing results in?

• One report - many mr jobs

• Enterprise 24/7 monitoring - SNMP

• Work with open source releases cycles

Monitor and manage

binary feed Hadoop MR Pig

JMS DFS MR PIG DB

• Replicate shards on different • Easy to integrate

• Apache Version 2 License

• Index serving tool, not indexer

What is a Katta index?

• Folder with Lucene indexes

• Shard Indexes can be zipped

HDFS, NAS or shared

java API Zookeeper Zookeeper

Node Node Node Node

java client API

• te?t or test* or te*t

• state:CA AND age:[1 TO 15] AND product:ipod

• state:CA AND age:[16 TO 21] AND product:ipod

• Deploy indexes in katta

• Merge indexes frequently together

• Find documents by key

• Find documents by query

XML to Lucene Document

<event id=”aKey” type=”sell”>

Counting results -> one network round trip

Range Queries Result Graph

• Zoom in data from complex • Report developing is easy

• Develop against changing open

• Pig is/was slow (always text) and

• Katta indexes need to merged

• Monitoring and management

• 0.2 Hadoop 0.17

• 0.3 Hadoop 0.18

• Add realtime update support

• Not yet clear how exactly

• Might be similar to Dynamo

You might also like