Professional Documents
Culture Documents
1-800-COURSESwww.globalknowledge.com
Introduction
In the world of big data, most industry experts agree that Hadoop stands alone as the go-to tool of choice for the efficient and cost-effective ingestion, analysis, and interpretation of the massive amounts of data that nearly every business finds itself swimming in. Business leaders have discovered that this data has real and profound impacts on the bottom line, and as a result, more and more IT departments are being tasked with creating and maintaining software to pull value out of that data, and many are deciding to build their solutions upon Hadoop. Hadoops open source core, mature codebase, and vibrant developer community all work in concert to make it the de facto platform upon which developers build and release mission-critical data storage and analysis software. It is every bit as innovative a technology as Linux was and continues to be, and for many of the same reasons an open-source, easily extensible framework that allows individuals to solve problems important to them, and a community that encourages the sharing of those solutions with others. Even MapReduce jobs are often compared to Linux pipes small, nimble, single-purpose tasks (like grep to filter, or awk/sed to modify/ ETL) that can be chained together in many different ways to accomplish something very powerful. Either by using the framework itself, or one of the multitude of ecosystem projects around its periphery, people use Hadoop as an extract, transform, load (ETL) platform (sqoop, HParser), a NoSQL database (HBase), a file server, an OLAP replacement, a Business Intelligence framework, a real-time data ingestion service (flume, chukwa), a distributed coordination service (zookeeper), a real-time BigData query engine (Impala, Drill), and a couple dozen other technologies that we dont have space to even introduce. For a long time in the Hadoop world, individuals have built and chained these pieces together in ways that solve new and novel problems, making Hadoop something of the Swiss Army Knife of IT. Lets look at each of these uses in more detail, and discuss how and why Hadoop is an excellent choice for the task.
If, like many organizations, you find yourself with a lot of unused disk space after adding more nodes for processing, you will see that Hadoop is a great place to store much of the data that youve been storing in your more expensive NetApp or EMC file servers. Dedicated file servers do what they do incredibly well, but chances are that a lot of the data youre storing in them doesnt require the access throughput and volume that they provide. Data you access less frequently can be migrated to HDFS and stored there for about 1/10th of the cost, still allowing your end users to interact with the data via a variety of clients and protocols. The Hadoop ecosystem provides FuseDFS (which lets users mount HDFS as if it were local), HttpFS (a restful HDFS interface), WebDAV, FTP, and many other access protocols and packages, with more being added every day. Even if there were no need/intention of ever running MapReduce on this data, just using Hadoop as a File Server could save your organization a lot of cash.
If you do decide to use HBase in your organization, be aware of one caveat -- you rarely want to also run MapReduce on the same data as the two processing stacks will fight each other for node resources. Most organizations that also expect to run MapReduce will maintain two separate clusters, one for HBase and one for MapReduce.
1
Learn More
To learn more about how you can improve productivity, enhance efficiency, and sharpen your competitive edge, Global Knowledge suggests the following courses: Cloudera Administrator Training for Apache Hadoop Cloudera Developer Training for Apache Hadoop Cloudera Essentials for Apache Hadoop Visit www.globalknowledge.com or call 1-800-COURSES (1-800-268-7737) to speak with a Global Knowledge training advisor.