You are on page 1of 8

CTOlabs.

com

White Paper: Hadoop for Intelligence Analysis


July 2011

A White Paper providing context, tips and use cases on the topic of analysis over large quantities of data.

Inside:

Apache Hadoop and Cloudera Intelligence Community Use Cases Context You Can Use Today

CTOlabs.com

Hadoop and related capabilities bring new advantages to intelligence analysts

Executive Summary
Intelligence analysis is all about dealing with Big Data, massive collections of unstructured information. Already, the Intelligence Community works with much more data than it can process and continues to collect more through new and evolving sensors, open-source intelligence, better information sharing, and continued human information gathering. More information is always better, but to make use if it, analysis needs to keep pace through innovations in data management.

The Apache Hadoop Technology


Apache Hadoop is a project operating under the auspices of the Apache Software Foundation (ASF). The Hadoop project develops open source software for reliable, scalable, distributed computing. Hadoop is an exciting technology that can help analysts and agencies make the most of their data. Hadoop can inexpensively store any type of information from any source on commodity hardware and allow for fast, distributed analysis run in parallel on multiple servers in a Hadoop cluster. Hadoop is reliable, managing and healing itself; scales linearly, working as well with one terabyte of data across three nodes as it does with petabytes of data across thousands; affordable, costing much less per terabyte to store and process data compared to traditional alternatives; and agile, implementing evolving schemas for data as it is read into the system. Cloudera is the leading provider of Hadoop-based software and services and provides Clouderas Distribution including Apache Hadoop (CDH) which is the most popular way to implement Hadoop. CDH is an open system assembled from the most useful projects in the Hadoop ecosystem bundled together and simplified for use. As CDH is available for free download, its a great place to start when implementing Hadoop, and Cloudera also offers support, management apps, and training to make the most of Hadoop.

A White Paper For The Federal IT Community

Intelligence Community Use Cases


Given its work with Big Data, implementing Hadoop is a natural choice for the Intelligence Community. Most broadly, using Hadoop to manage data can lead to substantial savings. Hadoop Distributed File System (HDFS) stores information for several cents a gigabyte a month as opposed to traditional methods that cost dollars. HDFS achieves this by pooling commodity servers into a single hierarchical namespace, which works well with large files that are written once and read many times. With organizations everywhere on the lookout for efficient ways to modernize IT, this is an attractive feature. Hadoop can provide opportunities to manage the same data for a fraction of the cost. The unpredictable nature of intelligence developments could also make good use of Hadoops scalability. Nodes can easily and seamlessly be added or removed from a Hadoop cluster and performance will scale linearly, providing the agility to rapidly mobilize resources. Consider how fast Intelligence Community missions must shift, for example. Just a month after Bin Ladens death the community was boresighted on operations in Libya. Who knows where the next focus area will be? The need is for an ability to easily shift computational power from one mission to another. Hadoop also provides the agility to deal with whatever form or source of information is needed for a project. Most of the data critical to intelligence is unstructured, such as text, video, and images and Hadoop can take data straight from the file store without any special transformation. As this complex data can then be stored and analyzed together, Hadoop can help intelligence agencies overcome information overload. Current technology and intelligence gathering methods produce expansive amounts of data, far more than what our analysts can reasonably hope to monitor on their own. For example, at the end of 2010, General Cartwright noted that it took 19 analysts to process the information gathered by a predator drone, and that, with new sensor technology being developed, dense data sensors meshing together video feeds to cover a city while simultaneously intercepting cell phone calls and e-mails, it would take 2,000 analysts to manually process the data gathered from a single drone. Algorithms are currently being developed to pull out and present to analysts what really matters from all of those video feeds and intercepts. Sorting through such an expanse of complex data is precisely the sort of challenge Hadoop was designed to tackle. The data also becomes more valuable when

CTOlabs.com

combined with information such as coordinates and times, and subjects identified in videos. Hadoop could then effectively sort and search through the new multi-level and multi-media intelligence almost instantly despite the amount and type of material generated. Hadoops low cost and speed open up other intelligence capabilities. Hadoop clusters have often been used as data sandboxes to test out new analytics cheaply and quickly to see if they yield results and can be widely implemented. For analysts, this would mean that they could test out theories and algorithms even when they have a low probability of success without much wasted time or resources, allowing them to be more creative and thorough. This in turn helps prevent the failures of imagination blamed for misreading the intelligence before the September 11 attacks and several subsequent plots. Hadoop is also well suited for evolving analysis techniques such as Social Network Analysis and textual analysis, which are both being aggressively developed by intelligence agencies and contractors. Social Network Analysis uses human interactions, such as phone calls, text messages, meetings, and emails, to construct and decipher a social network, identifying leaders and key nodes for linking members, linking groups, getting exposure, and contacting other important members. Rarely are these members the figureheads in the media or even the stated leadership of terrorist and criminal organizations. Social Network Analysis is helpful for identifying high value targets to exploit or eliminate but for sizable organizations deals with a tremendous amount of data, thousands of interactions of varying types by thousands of members and associates,making Hadoop an excellent platform. Some projects such as Klout, which applies Social Network Analysis to social media to determine user influence, style, and role, already run on Hadoop. Hadoop has also been proven as a platform for textual analysis. Large text repositories such as chat rooms, newspapers, or email inboxes are Big Data, expansive and unstructured, and hence well suited for analysis using Hadoop. IBMs Watson, which beat human contestants on Jeopardy!, has been the Leveraging Hardware Design to Enhance Security and Functionality most prominent example of the power of textual analysis, and ran on Hadoop. Watson was able to look through and interpret libraries of text to form the right question for the answers presented in the game, from history to science to pop culture, faster and more accurately than the human champions he went against. Textual analysis

A White Paper For The Federal IT Community

has value beyond game shows, however, as it can be used on forums and correspondences to analyze sentiment and find hidden connections to people, places, and topics.

For Further Reference


Many terms are used by technologists to describe the detailed features and functions provided in CDH. The following list may help you decipher the language of Big Data: CDH: Clouderas Distribution including Apache Hadoop. It contains HDFS, Hadoop MapReduce, Hive, Pig, HBase, Sqoop, Flume, Oozie, Zookeeper and Hue. When most people say Hadoop they mean CDH. HDFS: The Hadoop Distributed File System. This is a scalable means of distributing data that takes advantage of commodity hardware. HDFS ensures all data is replicated in a location-aware manner so as to lessen internal datacenter network load, which frees the network for more complicated transactions. Hadoop MapReduce: This process breaks down jobs across the Hadoop datanodes and then reassembles the results from each into a coherent answer. Hive: A data warehouse infrastructure that leverages the power of Hadoop. Hive provides tools that easily summarize queries. Hive puts structure on the data and gives users the ability to query using familiar methods (like SQL). Hive also allows MapReduce programmers to enhance their queries. Pig: A high level data-flow language that enables advanced parallel computation. Pig makes parallel programming much easier. HBase: A scalable, distributed database that supports structured data storage for large tables. Used when you need random, realtime read/write access to you Big Data. It enables hosting of very large tables-- billions of rows times millions of columns -- atop commodity hardware. It is a columnoriented store modeled after Googles BigTable and is optimized for realtime data. HBase has replaced Cassandra at Facebook.

CTOlabs.com

Sqoop: Enabling the import and export of SQL to Hadoop. Flume: A distributed, reliable and available service for efficiently collecting, aggregating and moving large amounts of streaming data. Oozie: A workflow engine to enhance management of data processing jobs for Hadoop. Manages dependencies of jobs between HDFS, Pig and MapReduce. Zookeeper: A very high performance coordination service for distributed applications. Hue: A browser-based desktop interface for interacting with Hadoop. It supports a file browser, job tracker interface, cluster health monitor and many other easy-to-use features.

More Reading
For more use cases for Hadoop in the intelligence community visit: CTOvision.com- an blog for enterprise technologists with a special focus on Big Data. CTOlabs.com - the respository for our research and reporting on all IT issues. Cloudera.com - providing enterprise solutions around CDH plus training, services and support.

A White Paper For The Federal IT Community

About the Author


Alexander Olesker is a technology research analyst at Crucial Point LLC, focusing on disruptive technologies of interest to enterprise technologists. He writes at http://ctovision.com. Alex is a graduate of the Edmund A. Walsh School of Foreign Service at Georgetown University with a degree in Science, Technology, and International Affairs. He researches and writes on developments in technology and government best practices for CTOvision.com and CTOlabs.com, and has written numerous whitepapers on these subjects. Alex has worked or interned in early childhood education, private intelligence, law enforcement, and academia, contributing to numerous publications on technology, international affairs, and security and has lectured at Georgetown and in the Netherlands. Alex is also the founder and primary contributor of an international security blog that has been quoted and featured by numerous pundits and the War Studies blog of Kings College, London. Alex is a fluent Russian speaker and proficient in French. Contact Alex at AOlesker@crucialpointllc.com

For More Information


If you have questions or would like to discuss this report, please contact me. As an advocate for better IT in government, I am committed to keeping the dialogue open on technologies, processes and best practices that will keep us moving forward. Contact: Bob Gourley bob@crucialpointllc.com 703-994-0549 All information/data 2011 CTOLabs.com.

CTOlabs.com

You might also like