Professional Documents
Culture Documents
Stefan Groschupf
Scale Unlimited, 101tec.
sg{at}101tec.com
foto by: belgianchocolate@flickr.com
2
Intro
• Business intelligence reports from event stream
• Scale problems
• Expensive
• Slow
Goals
• Build next generation platform for event stream processing
• Commodity hardware
• Better scalable
• Better performance
• Cheap storage
4
Challenge
• Integrate system into big picture
Challenges II
• Teleskop to Microscope - zoom to log record level
• Job Scheduling
Our Solution I
Customer Userinterface
Files by organized 8%
7%
by day Aggregate data and 10%
35%
generate report data Katta
11%
29%
Our Solution II
SQL
Binary tree format xml > tuples text tuples
Schema
Katta
• Serving indexes the hadoop • Lightweight
distributed file system way
• Master fail over
• Index as index shards on many
servers • Fast*
Contras
• No realtime updates like Solr,
Couch DB or Cassandra yet*
* though on roadmap
Lucene Index
Lucene Index
Lucene Index
hadoop cluster or
single server
Overview
<REST API/> *
fail over
command line
management
Secondary
Master
Master
assign download
shards shards
server nodes in the
grid
multicast query
shard replication
(plug-able policy) multicast query distributed ranking
plug-able selection
policy (custom load
balancing)
CLI
13
API
14
Lucene Queries
• title:"The Right Way" AND text:go
• mod_date:[20020101 TO 20030101]
Teleskop to Microscope
• Create Index from XML in MR
stage
/event/@id:aKey
/event/@type:sell
/event/product/@id:ipod
/event/user/@id:stefan
/event/user/@state:CA
/event/user/@ age:31
17
Range Queries
/event/product/@id:ipod AND /event/user/@state:CA AND
/event/user/@ age:[001 TO 010]
/event/product/@id:ipod AND /event/user/@state:CA AND
/event/user/@ age:[011 TO 020]
/event/product/@id:ipod AND /event/user/@state:CA AND
/event/user/@ age:[021 TO 030]
/event/product/@id:ipod AND /event/user/@state:CA AND
/event/user/@ age:[031 TO 040]
60,000
45,000
30,000
15,000
0
01-10 11-20 21-30 31-40
19
Pros
• Easy reports can be generated • System scales
from katta index
• Scaling is cheap
• Complex reports generated with
many pig statements (>30 job) • We keep more data
Problems
• There was no cascading, hive or
jaql, pig was very young
Roadmap
• 0.1 released
• Performance improvements
• EC2 support
Thanks
katta.sourceforge.net
sg{at}101tec.com