Apache Hadoop - The Swiss Army Knife of IT

Expert Reference Series of White Papers
Apache Hadoop: The Swiss Army Knife of IT
1-800-COURSESwww.globalknowledge.com
Apache Hadoop: The Swiss Army Knife of IT

Rich Morrow, Global Knowledge Instructor, Big Data Analyst
Introduction
In the world of big data, most industry experts agree that Hadoop stands alone as the go-to tool of choice for the efficient and cost-effective ingestion, analysis, and interpretation of the massive amounts of data that nearly every business finds itself swimming in. Business leaders have discovered that this data has real and profound impacts on the bottom line, and as a result, more and more IT departments are being tasked with creating and maintaining software to pull value out of that data, and many are deciding to build their solutions upon Hadoop. Hadoops open source core, mature codebase, and vibrant developer community all work in concert to make it the de facto platform upon which developers build and release mission-critical data storage and analysis software. It is every bit as innovative a technology as Linux was and continues to be, and for many of the same reasons an open-source, easily extensible framework that allows individuals to solve problems important to them, and a community that encourages the sharing of those solutions with others. Even MapReduce jobs are often compared to Linux pipes small, nimble, single-purpose tasks (like grep to filter, or awk/sed to modify/ ETL) that can be chained together in many different ways to accomplish something very powerful. Either by using the framework itself, or one of the multitude of ecosystem projects around its periphery, people use Hadoop as an extract, transform, load (ETL) platform (sqoop, HParser), a NoSQL database (HBase), a file server, an OLAP replacement, a Business Intelligence framework, a real-time data ingestion service (flume, chukwa), a distributed coordination service (zookeeper), a real-time BigData query engine (Impala, Drill), and a couple dozen other technologies that we dont have space to even introduce. For a long time in the Hadoop world, individuals have built and chained these pieces together in ways that solve new and novel problems, making Hadoop something of the Swiss Army Knife of IT. Lets look at each of these uses in more detail, and discuss how and why Hadoop is an excellent choice for the task.
Cost Effective Storage at the Core

A huge driver for Hadoop adoption is its extremely low cost it doesnt get much cheaper than free software running on commodity hardware. As organizations grow their cluster, they find themselves adding additional processing and storage with each new node. Regardless of whether you decide to use stock HDFS or a replacement storage engine, the end result is that you still end up with a very cost-effective, highly redundant, scalable storage solution exactly what File Server vendors have tried to sell you for years.
Copyright 2013 Global Knowledge Training LLC. All rights reserved.
If, like many organizations, you find yourself with a lot of unused disk space after adding more nodes for processing, you will see that Hadoop is a great place to store much of the data that youve been storing in your more expensive NetApp or EMC file servers. Dedicated file servers do what they do incredibly well, but chances are that a lot of the data youre storing in them doesnt require the access throughput and volume that they provide. Data you access less frequently can be migrated to HDFS and stored there for about 1/10th of the cost, still allowing your end users to interact with the data via a variety of clients and protocols. The Hadoop ecosystem provides FuseDFS (which lets users mount HDFS as if it were local), HttpFS (a restful HDFS interface), WebDAV, FTP, and many other access protocols and packages, with more being added every day. Even if there were no need/intention of ever running MapReduce on this data, just using Hadoop as a File Server could save your organization a lot of cash.
Cost-Savings AND a Strategic Business Advantage

The reality is, however, that youre much more likely to analyze data once its stored in your Hadoop cluster. The MapReduce algorithms that perform that analysis are often quite trivial pieces of code, usually no more than a few hundred lines at most, and a variety of alternate interfaces like Hive and Pig make that data even more accessible. There are many, many success stories of individuals using Hadoop to complete in hours or days what entire departments had estimated would take weeks or months. One such analysis role that Hadoop commonly finds itself tasked with is Online Analytics Processing (OLAP). OLAP is the act of materializing views of your data in a data warehouse in a way that makes specific lookups very fast. Materialized data is essentially denormalized into one or more cubes, and each cell of the cube is a fast lookup for that given data point (think sales by region by the salesperson being the x,y,z axes of a cube). OLAP again, is a market that has traditionally been served by old gsard companies with expensive, closed solutions, and the re-materialization of the views was oftentimes an arduous, time-consuming task. With a bit of data re-organization, Hadoop can often serve a great many of these OLAP queries faster, more flexibly, and more cost-effectively. That re-organization is often times in the form of an ETL operation. ETL is a broad term that can refer to everything from simply dumping and reloading data from database A (maybe SQL Server) to database B (maybe MySQL), all the way to reformatting, possibly joining, and/or denormalizing data so that it can be queried in new ways. MapReduce excels at ETL, and its a very common practice to (a) load data into HDFS, (b) do ETL on that data in one or more MapReduce jobs, and/or (c) query the translated data with one or more MapReduce jobs. There are many ecosystem projects that exist to serve that ETL purpose, with perhaps two of the most popular being sqoop (for database import/export from Hadoop) and HParser (Informaticas transformation engine, which handles import/export of XML, JSON, Web logs, and just about everything else). Best of all, many of the old guard data warehouse companies now provide interfaces to Hadoop so that your companys analysts can continue using Business Intelligence tools that they are familiar with. Lets pause here to reflect on what weve discussed so far. In the examples above, we used Hadoop for three very different purposes: file server, ETL, and OLAP. We could use a single Hadoop cluster to serve all three purposes just as easily as we could use it for any one purpose alone. In fact, we could even combine all three jobs (migrate, ETL, OLAP) into a single job via a workflow engine like Oozie, and then run that job weekly, monthly, or as needed. There are lots of nimble bits you can chain together into larger tasks.
Not Your Fathers Data Analysis Oldsmobile

If youre thoroughly unimpressed so far, youre not alone. Most organizations already do what weve discussed, and the cost delta or speed of Hadoop may not be such a compelling factor to switch over to a new platform. But in addition to being faster, cheaper, and more flexible, Hadoop and its ecosystem projects also enable tasks that were previously either extremely difficult or downright impossible. Whether youre a large enterprise or a scrappy startup, your first encounter with Hadoop may well come via HBase, Zookeeper, or Impala all projects that grew out of the need to solve recent problems of scale. HBase is a strongly consistent, distributed NoSQL datastore that rides on top of HDFS. It is a column-oriented, key-value store that delivers extremely high throughput on massive amounts of data. It cheerily handles millions of rows, billions of columns, and petabytes of data in a single table. To achieve that high throughput, it actually uses its own mechanisms (mainly caching, in-memory operations, and specialized coordination) instead of Hadoops MapReduce. Behemoths like eBay and Facebook have made heavy use of HBase for some time. In many use cases, HBase often outperforms any other solutions in the NoSQL world. If your company is developing the next big thing and needs the ability to store and quickly retrieve massive amounts of loosely structured data, HBase could be just what the doctor ordered1. If you do have the need to store massive amounts of unstructured or loosely structured data, chances are that the data is coming from a wide variety of devices and/or sources. Embedded network-aware devices are popping up everywhere, from your wrist to your home thermostat, and some futurists predict that the age of the terrabyters (when a single individual generates a terabyte of data every day) is already upon us. The Hadoop ecosystem is also on top of this, with several standalone projects, Flume and Chukwa being among the more popular, that enable real-time, highly scalable pre-processing and ingestion of data from a large number of distributed upstream devices. Like many other Hadoop ecosystem projects, these systems started out tightly connected to Hadoop, but now its common to see many organizations using Cassandra or another NoSQL data store as the eventual sink rather than HDFS. An ecosystem project spawning off on its own and finding uses completely outside of Hadoop is another very common pattern. Apache Zookeeper is another example of one such project. Originally built by Yahoo to handle complex resource management (think naming, locking, synchronizing) inside of large clusters, Zookeeper is used heavily in HBase as well as at least a dozen other roles/projects outside of Hadoop. Because it performs distributed resource management so well, its become a leading choice for anyone wishing to build a distributed application.
If you do decide to use HBase in your organization, be aware of one caveat -- you rarely want to also run MapReduce on the same data as the two processing stacks will fight each other for node resources. Most organizations that also expect to run MapReduce will maintain two separate clusters, one for HBase and one for MapReduce.
1
The Hadoop Evolution: From Batch to Real-Time Queries

Perhaps some of the most beneficial uses of Hadoop are yet to come, and among them Apache Impala ranks high on the list. Impala is a Cloudera-developed, real-time query engine that, like HBase, uses its own processing architecture instead of MapReduce. Whereas a MapReduce job might take minutes or hours to complete, an Impala query might return in milliseconds, allowing internal or external users to query HDFS or HBase in realtime. Although Cloudera expects its first production-ready code drop of Impala sometime in Q1 of 2013, many organizations are already rumored to be deploying it in production. Real-time querying of an enterprises data warehouse is about as innovative as technology can get, and Impala looks to be one of the biggest game-changers ever to come through the Hadoop ecosystem. Impala is likely to bring a whole new wave of Hadoop adoption, and with it, a new wave of business problems ripe for solving by tools in the Hadoop ecosystem. Those who dont find a suitable tool for their purpose will find a solid foundation and community on which to build those tools and deliver them back into the ecosystem.
Big data Analysis Meets Open-source Innovation

Just as Linux spawned decades of innovation and real-world problem-solving in the infrastructure domain, Hadoops influence in the Big data domain will be real, lasting, and felt for decades to come. Those companies and individuals who embrace and learn to skillfully wield these technologies will be well poised to succeed just as quickly as those who do not will be left in the dust. If weve been able to convince you that Hadoop is a powerful tool that will only continue to grow in use, then the next logical step is to get proficient in it. The quickest way I know of to do that is to get formal training from the professionals. Global Knowledge is a Cloudera-certified training partner for Hadoop, offering a wide variety of one to four-day courses for Developers, System Administrators, Managers, and Business Analysts. The constant evolution and changes in the Hadoop framework is both a blessing and challenge for IT professionals. Real-time changes also demands real-time training from technology experts and practitioners to master Hadoop. In just a couple days of formal training, you can go from a complete novice to a master, ready to both use and develop the next tool in the Hadoop Swiss Army Knife.
Learn More
To learn more about how you can improve productivity, enhance efficiency, and sharpen your competitive edge, Global Knowledge suggests the following courses: Cloudera Administrator Training for Apache Hadoop Cloudera Developer Training for Apache Hadoop Cloudera Essentials for Apache Hadoop Visit www.globalknowledge.com or call 1-800-COURSES (1-800-268-7737) to speak with a Global Knowledge training advisor.
About the Author

Rich brings two decades of experience in IT as a Developer, System Administrator, Trainer, Mentor, and Team Builder. As a consultant, Rich is focused on development and maintenance of large-scale mission-critical custom Web applications, and particularly those leveraging LAMP. Coming from a startup-centric, Development Operations background, Rich has a special appreciation and love for enabling technologies like Cloud and Big data.

Apache Hadoop - The Swiss Army Knife of IT

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Apache Hadoop - The Swiss Army Knife of IT

Uploaded by

Copyright:

Available Formats

Expert Reference Series of White Papers