You are on page 1of 24

Search Analytics

with

Flume & HBase

Otis Gospodnetić ••• Sematext International

1
Copyright 2010 Sematext Int'l. All rights reserved.
Agenda

Who I am

What Why How

Architecture Evolution

Role of Flume and HBase + Flume HBase Sink

Challenges

2
Copyright 2010 Sematext Int'l. All rights reserved.
About Otis Gospodnetić
• Lucene/Solr/Nutch/Mahout committer

• Lucene in Action 1 & 2 co-author

• Lucene Consulting since 2005

• Sematext Int'l since 2007

3
Copyright 2010 Sematext Int'l. All rights reserved.
About Sematext
Consulting, development, support for:

Big Data (Hadoop, HBase, Voldemort...)


Search (Lucene, Solr, Elastic Search...)


Web Crawling (Nutch)


Machine Learning (Mahout)
4
Copyright 2010 Sematext Int'l. All rights reserved.
What We Built

Analytics for Search

Numerous reports (e.g. query volume, rate, latency,
term frequencies / comparisons, hit buckets, search
origins, etc.)

Trending over time

Comparisons of time periods

Top N reports

Various report filters

5
Copyright 2010 Sematext Int'l. All rights reserved.
Report Example

6
Copyright 2010 Sematext Int'l. All rights reserved.
Why We Built it
subliminal msg: go use this site

We need it

search-hadoop.com & search-lucene.com


Search customers need it

Want to know what their visitors are searching for

Want to know how their search is behaving

7
Copyright 2010 Sematext Int'l. All rights reserved.
How We Built it

JavaScript Beacons

Metric Capture Web App

Data Capture Mechanisms

Custom Log4J Appender

Flume Agents, Collectors, Sinks

HBase

MapReduce Aggregations

Search Analytics Reporting Web App

8
Copyright 2010 Sematext Int'l. All rights reserved.
What's Flume

Distributed data/log collection service

Scalable, configurable, extensible

Centrally manageable, open source


Agents get data from app, Collectors save it

Abstractions: Source → Decorator(s) → Sink

9
Copyright 2010 Sematext Int'l. All rights reserved.
What's HBase

Scalable, reliable, distributed, column-oriented
DB

On top of HDFS

MapReducable

10
Copyright 2010 Sematext Int'l. All rights reserved.
High Level Architecture

11
Copyright 2010 Sematext Int'l. All rights reserved.
Architecture #1

12
Copyright 2010 Sematext Int'l. All rights reserved.
Architecture #1 - Getting Messy

13
Copyright 2010 Sematext Int'l. All rights reserved.
Arch #2 – HBaseLog4JAppender

14
Copyright 2010 Sematext Int'l. All rights reserved.
HBaseLog4JAppender Cons

Doesn't help with reliable delivery

e.g. when network or HBase down


Non-centralized config with larger clusters

e.g. changing destination table in HBase

e.g. changing sampling rate

15
Copyright 2010 Sematext Int'l. All rights reserved.
Architecture #3 – Flume OOTB

16
Copyright 2010 Sematext Int'l. All rights reserved.
Arch #4 – Flume HBase Sink

17
Copyright 2010 Sematext Int'l. All rights reserved.
FLUME-247 – Flume HBase sink

Contributed by Sematext in September 2010


Reviewed, pending commit


Similar to FLUME-6 (basic example), but more
flexible


https://issues.cloudera.org/browse/FLUME-247

18
Copyright 2010 Sematext Int'l. All rights reserved.
Walk-Through

Start EC2 micro instance, configure logs-generation tool to simulate user actions

User actions start getting logged to a log file

Configure Flume Agent to "tail" the generated logs and send data to Flume Collector

Collector processes log messages and sends them to HBase's "raw logs" table

Later these logs are processed by the MapReduce job

Search Action → Metric Capture → Log File → Flume Agent → Flume Collector →
Decorators → HBase Sink → HBase


Decorator: processes Flume Collector log events and prepares them for HBase

HBase sink: FLUME-247

19
Copyright 2010 Sematext Int'l. All rights reserved.
Why Flume

Reliable delivery

e.g. queue msgs locally if destination unreachable

Easy, centralized management via Web UI or
console

Good community, good progress

But: more complex, more moving parts

On Flume: slideshare.net/cloudera/inside-flume

20
Copyright 2010 Sematext Int'l. All rights reserved.
Why HBase

Scalable raw search data storage

MapReduce data input

Scalable aggregate data storage

Fast scans for time ranges, fast key lookups

Easy storage and compute power expansion

Good looking roadmap, community, progress

21
Copyright 2010 Sematext Int'l. All rights reserved.
Challenges

“HBase in a box” is like “dynamic equilibrium”,
or “virtual reality”, or “jumbo shrimp” –
search-hadoop.com/m/p68C12nb7Hn

Data size. Solutions:

Compression (4-5x smaller with lzo)

Data pruning (variable levels)

Query string distribution: very long-tail

Lots of data to process, update, aggregate

22
Copyright 2010 Sematext Int'l. All rights reserved.
Work @ Sematext

We are hiring world-wide!

Search & Data Analytics


Machine Learning & NLP
Biiig Data

23
Copyright 2010 Sematext Int'l. All rights reserved.
Contact
• sematext.com
• blog.sematext.com
• @sematext
• @otisg
• otis@sematext.com

24
Copyright 2010 Sematext Int'l. All rights reserved.

You might also like