Professional Documents
Culture Documents
with
1
Copyright 2010 Sematext Int'l. All rights reserved.
Agenda
●
Who I am
●
What Why How
●
Architecture Evolution
●
Role of Flume and HBase + Flume HBase Sink
●
Challenges
2
Copyright 2010 Sematext Int'l. All rights reserved.
About Otis Gospodnetić
• Lucene/Solr/Nutch/Mahout committer
3
Copyright 2010 Sematext Int'l. All rights reserved.
About Sematext
Consulting, development, support for:
●
Big Data (Hadoop, HBase, Voldemort...)
●
Search (Lucene, Solr, Elastic Search...)
●
Web Crawling (Nutch)
●
Machine Learning (Mahout)
4
Copyright 2010 Sematext Int'l. All rights reserved.
What We Built
●
Analytics for Search
●
Numerous reports (e.g. query volume, rate, latency,
term frequencies / comparisons, hit buckets, search
origins, etc.)
●
Trending over time
●
Comparisons of time periods
●
Top N reports
●
Various report filters
5
Copyright 2010 Sematext Int'l. All rights reserved.
Report Example
6
Copyright 2010 Sematext Int'l. All rights reserved.
Why We Built it
subliminal msg: go use this site
●
We need it
●
search-hadoop.com & search-lucene.com
●
Search customers need it
●
Want to know what their visitors are searching for
●
Want to know how their search is behaving
●
…
7
Copyright 2010 Sematext Int'l. All rights reserved.
How We Built it
●
JavaScript Beacons
●
Metric Capture Web App
●
Data Capture Mechanisms
●
Custom Log4J Appender
●
Flume Agents, Collectors, Sinks
●
HBase
●
MapReduce Aggregations
●
Search Analytics Reporting Web App
8
Copyright 2010 Sematext Int'l. All rights reserved.
What's Flume
●
Distributed data/log collection service
●
Scalable, configurable, extensible
●
Centrally manageable, open source
●
Agents get data from app, Collectors save it
●
Abstractions: Source → Decorator(s) → Sink
9
Copyright 2010 Sematext Int'l. All rights reserved.
What's HBase
●
Scalable, reliable, distributed, column-oriented
DB
●
On top of HDFS
●
MapReducable
10
Copyright 2010 Sematext Int'l. All rights reserved.
High Level Architecture
11
Copyright 2010 Sematext Int'l. All rights reserved.
Architecture #1
12
Copyright 2010 Sematext Int'l. All rights reserved.
Architecture #1 - Getting Messy
13
Copyright 2010 Sematext Int'l. All rights reserved.
Arch #2 – HBaseLog4JAppender
14
Copyright 2010 Sematext Int'l. All rights reserved.
HBaseLog4JAppender Cons
●
Doesn't help with reliable delivery
●
e.g. when network or HBase down
●
Non-centralized config with larger clusters
●
e.g. changing destination table in HBase
●
e.g. changing sampling rate
15
Copyright 2010 Sematext Int'l. All rights reserved.
Architecture #3 – Flume OOTB
16
Copyright 2010 Sematext Int'l. All rights reserved.
Arch #4 – Flume HBase Sink
17
Copyright 2010 Sematext Int'l. All rights reserved.
FLUME-247 – Flume HBase sink
●
Contributed by Sematext in September 2010
●
Reviewed, pending commit
●
Similar to FLUME-6 (basic example), but more
flexible
●
https://issues.cloudera.org/browse/FLUME-247
18
Copyright 2010 Sematext Int'l. All rights reserved.
Walk-Through
●
Start EC2 micro instance, configure logs-generation tool to simulate user actions
●
User actions start getting logged to a log file
●
Configure Flume Agent to "tail" the generated logs and send data to Flume Collector
●
Collector processes log messages and sends them to HBase's "raw logs" table
●
Later these logs are processed by the MapReduce job
Search Action → Metric Capture → Log File → Flume Agent → Flume Collector →
Decorators → HBase Sink → HBase
●
Decorator: processes Flume Collector log events and prepares them for HBase
●
HBase sink: FLUME-247
19
Copyright 2010 Sematext Int'l. All rights reserved.
Why Flume
●
Reliable delivery
●
e.g. queue msgs locally if destination unreachable
●
Easy, centralized management via Web UI or
console
●
Good community, good progress
●
But: more complex, more moving parts
●
On Flume: slideshare.net/cloudera/inside-flume
20
Copyright 2010 Sematext Int'l. All rights reserved.
Why HBase
●
Scalable raw search data storage
●
MapReduce data input
●
Scalable aggregate data storage
●
Fast scans for time ranges, fast key lookups
●
Easy storage and compute power expansion
●
Good looking roadmap, community, progress
21
Copyright 2010 Sematext Int'l. All rights reserved.
Challenges
●
“HBase in a box” is like “dynamic equilibrium”,
or “virtual reality”, or “jumbo shrimp” –
search-hadoop.com/m/p68C12nb7Hn
●
Data size. Solutions:
●
Compression (4-5x smaller with lzo)
●
Data pruning (variable levels)
●
Query string distribution: very long-tail
●
Lots of data to process, update, aggregate
22
Copyright 2010 Sematext Int'l. All rights reserved.
Work @ Sematext
23
Copyright 2010 Sematext Int'l. All rights reserved.
Contact
• sematext.com
• blog.sematext.com
• @sematext
• @otisg
• otis@sematext.com
24
Copyright 2010 Sematext Int'l. All rights reserved.