Search Analytics With Flume and HBase

Search Analytics
with
Flume & HBase
Otis Gospodnetić ••• Sematext International
1
Copyright 2010 Sematext Int'l. All rights reserved.
Agenda
●
Who I am
●
What Why How
●
Architecture Evolution
●
Role of Flume and HBase + Flume HBase Sink
●
Challenges
2
About Otis Gospodnetić
• Lucene/Solr/Nutch/Mahout committer
• Lucene in Action 1 & 2 co-author
• Lucene Consulting since 2005
• Sematext Int'l since 2007
3
About Sematext
Consulting, development, support for:
●
Big Data (Hadoop, HBase, Voldemort...)
●
Search (Lucene, Solr, Elastic Search...)
●
Web Crawling (Nutch)
●
Machine Learning (Mahout)
4
What We Built
●
Analytics for Search
●
Numerous reports (e.g. query volume, rate, latency,
term frequencies / comparisons, hit buckets, search
origins, etc.)
●
Trending over time
●
Comparisons of time periods
●
Top N reports
●
Various report filters
5
Report Example
6
Why We Built it
subliminal msg: go use this site
●
We need it
●
search-hadoop.com & search-lucene.com
●
Search customers need it
●
Want to know what their visitors are searching for
●
Want to know how their search is behaving
●
…
7
How We Built it
●
JavaScript Beacons
●
Metric Capture Web App
●
Data Capture Mechanisms
●
Custom Log4J Appender
●
Flume Agents, Collectors, Sinks
●
HBase
●
MapReduce Aggregations
●
Search Analytics Reporting Web App
8
What's Flume
●
Distributed data/log collection service
●
Scalable, configurable, extensible
●
Centrally manageable, open source
●
Agents get data from app, Collectors save it
●
Abstractions: Source → Decorator(s) → Sink
9
What's HBase
●
Scalable, reliable, distributed, column-oriented
DB
●
On top of HDFS
●
MapReducable
10
High Level Architecture
11
Architecture #1
12
Architecture #1 - Getting Messy
13
Arch #2 – HBaseLog4JAppender
14
HBaseLog4JAppender Cons
●
Doesn't help with reliable delivery
●
e.g. when network or HBase down
●
Non-centralized config with larger clusters
●
e.g. changing destination table in HBase
●
e.g. changing sampling rate
15
Architecture #3 – Flume OOTB
16
Arch #4 – Flume HBase Sink
17
FLUME-247 – Flume HBase sink
●
Contributed by Sematext in September 2010
●
Reviewed, pending commit
●
Similar to FLUME-6 (basic example), but more
flexible
●
https://issues.cloudera.org/browse/FLUME-247
18
Walk-Through
●
Start EC2 micro instance, configure logs-generation tool to simulate user actions
●
User actions start getting logged to a log file
●
Configure Flume Agent to "tail" the generated logs and send data to Flume Collector
●
Collector processes log messages and sends them to HBase's "raw logs" table
●
Later these logs are processed by the MapReduce job
Search Action → Metric Capture → Log File → Flume Agent → Flume Collector →
Decorators → HBase Sink → HBase
●
Decorator: processes Flume Collector log events and prepares them for HBase
●
HBase sink: FLUME-247
19
Why Flume
●
Reliable delivery
●
e.g. queue msgs locally if destination unreachable
●
Easy, centralized management via Web UI or
console
●
Good community, good progress
●
But: more complex, more moving parts
●
On Flume: slideshare.net/cloudera/inside-flume
20
Why HBase
●
Scalable raw search data storage
●
MapReduce data input
●
Scalable aggregate data storage
●
Fast scans for time ranges, fast key lookups
●
Easy storage and compute power expansion
●
Good looking roadmap, community, progress
21
Challenges
●
“HBase in a box” is like “dynamic equilibrium”,
or “virtual reality”, or “jumbo shrimp” –
search-hadoop.com/m/p68C12nb7Hn
●
Data size. Solutions:
●
Compression (4-5x smaller with lzo)
●
Data pruning (variable levels)
●
Query string distribution: very long-tail
●
Lots of data to process, update, aggregate
22
Work @ Sematext
We are hiring world-wide!
Search & Data Analytics

Machine Learning & NLP
Biiig Data
23
Contact
• sematext.com
• blog.sematext.com
• @sematext
• @otisg
• otis@sematext.com
24

Search Analytics With Flume and HBase

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Search Analytics With Flume and HBase

Uploaded by

Copyright:

Available Formats

Search Analytics

Flume & HBase

Otis Gospodnetić ••• Sematext International

• Lucene in Action 1 & 2 co-author

• Lucene Consulting since 2005

• Sematext Int'l since 2007

We are hiring world-wide!

Search & Data Analytics

You might also like