You are on page 1of 19

Arinto Murdopo Josep Subirats Group 4 EEDC 2012

Outline
Current problem What is Apache Flume? The Flume Model Flows and Nodes Agent, Processor and Collector Nodes Data and Control Path Flume goals Reliability Scalability Extensibility Manageability Use case: Near Realtime Aggregator

Current Problem
Situation:
You have hundreds of services running in different servers that produce lots of large logs which should be analyzed altogether. You have Hadoop to process them.

Problem:
How do I send all my logs to a place that has Hadoop? I need a reliable, scalable, extensible and manageable way to do it!

What is Apache Flume? It is a distributed data collection service that gets


flows of data (like logs) from their source and aggregates them to where they have to be processed. Goals: reliability, scalability, extensibility, manageability.

Exactly what I needed!

The Flume Model: Flows and Nodes A flow corresponds to a type of data source (server
logs, machine monitoring metrics...). Flows are comprised of nodes chained together (see slide 7).

The Flume Model: Flows and Nodes In a Node, data come in through a source...
...are optionally processed by one or more decorators... ...and then are transmitted out via a sink. Examples: Console, Exec, Syslog, IRC, Twitter, other nodes... Examples: Console, local files, HDFS, S3, other nodes... Examples: wire batching, compression, sampling, projection, extraction...

The Flume Model: Agent, Processor and Collector Nodes Agent:


receives data from an application.

Processor (optional):
intermediate processing.

Collector:
write data to permanent storage.

The Flume Model: Data and Control Path (1/2)


Nodes are in the data path.

The Flume Model: Data and Control Path (2/2)


Masters are in the control path.
Centralized point of configuration. Multiple: ZK. Specify sources, sinks and control data flows.

Flume Goals: Reliability


Tunable Failure Recovery Modes

Best Effort Store on Failure and Retry End to End Reliability

Flume Goals: Scalability


Horizontally Scalable Data Path

Load Balancing

Flume Goals: Scalability


Horizontally Scalable Control Path

Flume Goals: Extensibility


Simple Source and Sink API Event streaming and composition of simple
operation

Plug in Architecture
Add your own sources, sinks, decorators

Flume Goals: Manageability


Centralized Data Flow Management Interface

Flume Goals: Manageability


Configuring Flume
Node: tail(file) | filter [ console, roll (1000) { dfs(hdfs://namenode/user/flume) } ] ;

Output Bucketing

/logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt

Use Case: Near Realtime Aggregator

Conclusion
Flume is Distributed data collection service Suitable for enterprise setting Large amount of log data to process

Q&A
Questions to be unveiled?

References

http://www.cloudera. com/resource/chicago_data_summit_flume_an_introduction_jonathan_hsie h_hadoop_log_processing/ http://www.slideshare.net/cloudera/inside-flume http://www.slideshare.net/cloudera/flume-intro100715 http://www.slideshare.net/cloudera/flume-austin-hug-21711

You might also like