Professional Documents
Culture Documents
Outline
Current problem What is Apache Flume? The Flume Model Flows and Nodes Agent, Processor and Collector Nodes Data and Control Path Flume goals Reliability Scalability Extensibility Manageability Use case: Near Realtime Aggregator
Current Problem
Situation:
You have hundreds of services running in different servers that produce lots of large logs which should be analyzed altogether. You have Hadoop to process them.
Problem:
How do I send all my logs to a place that has Hadoop? I need a reliable, scalable, extensible and manageable way to do it!
The Flume Model: Flows and Nodes A flow corresponds to a type of data source (server
logs, machine monitoring metrics...). Flows are comprised of nodes chained together (see slide 7).
The Flume Model: Flows and Nodes In a Node, data come in through a source...
...are optionally processed by one or more decorators... ...and then are transmitted out via a sink. Examples: Console, Exec, Syslog, IRC, Twitter, other nodes... Examples: Console, local files, HDFS, S3, other nodes... Examples: wire batching, compression, sampling, projection, extraction...
Processor (optional):
intermediate processing.
Collector:
write data to permanent storage.
Load Balancing
Plug in Architecture
Add your own sources, sinks, decorators
Output Bucketing
Conclusion
Flume is Distributed data collection service Suitable for enterprise setting Large amount of log data to process
Q&A
Questions to be unveiled?
References
http://www.cloudera. com/resource/chicago_data_summit_flume_an_introduction_jonathan_hsie h_hadoop_log_processing/ http://www.slideshare.net/cloudera/inside-flume http://www.slideshare.net/cloudera/flume-intro100715 http://www.slideshare.net/cloudera/flume-austin-hug-21711