You are on page 1of 12

White Paper

UNderstaNdINg BIg data aNd the QFaBrIC system


QFabric system enables a high-Performance, scalable Big data Infrastructure with simplicity

Copyright 2012, Juniper Networks, Inc.

White Paper - Understanding Big Data and the QFabric System

Table of Contents
executive summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Big data Use Cases in healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 apache hadoopthe Big Player . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 the Networks role in hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 data reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 Network Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 QFabric system support for Big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Big data Infrastructure strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Operational simplicity at scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 data reliability at scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 Performance at scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 about Juniper Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

List of Figures
Figure 1: Big data and hadoop cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Figure 2: the networks role in hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Figure 3: Comparison of a chassis switch and a QFabric system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Figure 4: an optimized mid-size hadoop cluster with QFabric system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

Copyright 2012, Juniper Networks, Inc.

White Paper - Understanding Big Data and the QFabric System

Executive Summary
In todays increasingly complex and volatile business environment, organizations need to constantly adopt innovative technologies to compete. Big data refers to a collection of data that is beyond the ability of typical database software tools to collect, process, and deliver. When analyzed properly, big data can provide new business insights, open new markets, and create new competitive advantage in many industries. according to mgI1, for example, retailers can realize the potential of a sixty percent increase in operating margins by fully harnessing big data. In the healthcare industry, big data can reduce costs and enhance patient outcomes in diagnosis and treatment by driving efficiency, transparency, and quality. IdC expects the big data technology and services market to grow from $3.2 billion in 2010 to $16.9 billion in 2015, making it one of the fastest growing areas in the overall information and communication technology (ICt) market2. Big data has recently become an issue for organizations due to the dramatic increase in data creation and data gathering, driven by a number of technological innovations. the rise of mobile users has increased the aggregation of user statistics in the enterprise and, if properly synthesized and analyzed, these same statistics can provide highly relevant and competitive business intelligence. the increasing use of sensors for everything from traffic patterns, purchasing and buying behaviors, to real-time inventory management is another good example of the significant increase in large data sets. much of this data, gathered in real time, can provide unique and powerful intelligence, especially if it can be analyzed and acted upon quickly. Contrary to structured data that has historically been stored in data warehouses and analyzed with structured Query Language (sQL) analysis tools, big data requires a flat, horizontally scalable database, often with unique query tools that work in real time (as opposed to time delineated snapshots). It organizations must invest in new technologies and architectures in order to best leverage and gain advantage from the power of these new massive real-time data streams. In short, the big data phenomenon brings up a challenging question for CIOs and CtOs: What is the best big data infrastructure strategy? Fortunately, as big data pilots launch and business cases solidify, there are a number of changes occurring in network architectures that can enhance and help integrate big data processing and insights. Just as big data applications represent a new way of collecting, analyzing, and taking action on business data, the underlying network foundation of big data projects should be considered in a new light. Network architectures can either enhance or inhibit the ability to easily launch, grow, and integrate big data initiatives from pilot projects to largescale production. Consider apache hadoop, the de facto big data platform. to manage and process data in a server cluster, which is required to scale to thousands of servers, the performance and manageability of the network is critical. In fact, most data center infrastructures, especially the ones based on multitier networking, face operation and performance challenges in storing and analyzing big data in a midsize or large size hadoop cluster, which can interconnect thousands of servers. Organizations need a network solution that overcomes these issues and enables them to gain the considerable business intelligence benefits of big data analytics. Organizations can achieve simplified management, improved performance, and optimized data reliability by using Juniper Networks QFabric family of products. as a single logical switch, the QFabric system supports over 6,000 10gbe ports, and provides a consistent, extremely low latency of <5 microseconds even under load3. this paper provides an overview of big data use cases, discusses the networks role in hadoop clusters, and describes how QFabric technology supports a high-performance and scalable big data infrastructure.

For further information on big data use cases, see the mcKinsey global Institute article, Big data: the next frontier for innovation, competition, and productivity, at: www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation. For further information concerning IdC Worldwide Big data technology and services 2012-2015 Forecast, visit www.idc.com/getdoc.jsp?containerId=233485. For further information concerning the QFabric system, visit www.juniper.net/us/en/products-services/switching/qfx-series/qfabric-system/#literature.

2 3

Copyright 2012, Juniper Networks, Inc.

White Paper - Understanding Big Data and the QFabric System

Introduction
Big data refers to a volume of data that is beyond the ability of typical database software tools to collect, process, and deliver. When analyzed properly, big data can deliver new business insights, open new markets, and create competitive advantage. Compared to structured data that historically has been stored through data warehousing stores and analyzed with sQL analysis tools, big data has three major attributes: variety, volume, and velocity. First, big data extends beyond structured data and includes semi-structured or unstructured data of all varieties. this can include event data from active machine log files, text from social networking, data streams from financial data services, click streams from customers accessing web applications, activities data from machine-to-machine interchange, and even audio and video files. second, big data comes in large sized data sets because its predictability and analytics capabilities rely on sufficient data points, dealing with everything from traffic patterns, purchasing and buying behaviors, or real-time inventory management. Organizations are awash with data, easily amassing hundreds of terabytes and petabytes of information. third, organizations can maximize their datas business value by analyzing streaming. the rise of the security event manager (sem) industry is at the heart of gathering, analyzing, and proactively responding to event data from active machine log files in real time, providing unique and powerful business intelligence. driven by a combination of technology innovation, ubiquitous social networking, and pervasive mobile devices, the rise of big data has created an inflection point for organizations as they look for innovative ways to do business effectively and economically.

Big Data Use Cases in Healthcare


according to a recent study,4 using big data to address spiraling healthcare costs and make intelligent decisions by analyzing information across departments and providers will greatly improve business efficiency and in the end, improve patient outcomes. Besides the clear economic benefit, big data is so compelling in healthcare because of the sheer volume of raw data generated, and the ability to rapidly analyze that data to enhance knowledge and improve the quality of patient care. Improved Diagnosis and Treatment. Leveraging big data can greatly aid in diagnosing and treating illnesses, improving patient outcomes, and in many cases saving lives. Considering that up to 25 percent of all diagnoses are not supported by any data analytics, and that 100,000 americans die as a direct result of medical errors every year, improving patient outcomes is a key focus for the industry. Disease Trend Analysis. the global nature of society has had a profound impact on the spread of disease. regional diseases can quickly develop into global pandemics, which is why the industry is leveraging big data to offer much needed disease trend analysis. the worlds governments had a genuine concern over the avian influenza pandemic of 2009, and while the actual confirmed deathsapproximately 15,000were less than feared, other pandemics such as West Nile virus, mad-cow disease and tuberculosis are not only a present day concern but unfortunately a serious concern for the future as well. Big data can be used to predict and reduce future pandemics. Predictive Analysis. how many lives would have been saved if the industry had the ability to more quickly see the correlation between Vioxx and heart attacks? In congressional testimony regarding Vioxx, dr. david graham stated that conservatively 100,000 people have had heart attacks as a result of using Vioxx, leading to between 30,000 and 40,000 deaths. the way big data can positively impact healthcare is clear. In fact, the mcKinsey global Institute report (refer to footnote 1) opined that the healthcare sector could create more than $300 billion in value every year by implementing big data analytics. a significant amount of data is available and ready for analysis, so the time is right to make sure that big data delivers this predicted value.

For further information on the article, how Big data Can mend Our Broken healthcare system, ewing marion Kauffman Foundation (april 2012), visit www.smartplanet.com/blog/business-brains/how-big-data-can-mend-our-broken-healthcare-system-study/23728.

Copyright 2012, Juniper Networks, Inc.

White Paper - Understanding Big Data and the QFabric System

Apache Hadoopthe Big Player


apache hadoop is a widely deployed platform for managing and processing big data. many companies such as IBm, mapr technologies, and Cloudera provide commercially licensed hadoop software stacks with improved performance and/or value-added features. hadoop includes the following two key functions: hadoop distributed File system (hdFs) is a large-scale distributed file system, which consists of hundreds and thousands of servers, each storing part of the file systems data. hdFs not only provides high aggregate data bandwidth to handle large datasets, but it also addresses frequent hardware failures in a large size deployment. mapreduce is a distributed analytics program designed for big data. It maps and processes input records in each server, collects and shuffles intermediate results through the network, and reduces them to a final result. the performance of each taskthe execution time between start and completion of a mapreduce taskcan vary from minutes to hours depending on data complexity and size of the hadoop cluster. In a typical hadoop cluster, as shown in Figure 1, each server may take any combination of the following four roles: the Client collects data from internal data sources within the data center or from external data sources on the cloud services, submits mapreduce tasks to the Job tracker, and delivers the analytics results to business applications. the Job tracker schedules the mapreduce tasks and ultimately produces analytics results. the Name Node manages hdFs metadata, by keeping track of the location of each files data blocks throughout the cluster. the data block default size is 64 mB, but it can be set to 512 mB or 1 gB depending on network performance. the data Node holds hdFs data blocks on its local drives, but also executes tasks assigned from the Job tracker. the hadoop cluster acts as a unified storage and analytics pool with hundreds or thousands of data Nodes. hadoop has been tested and deployed by some of the worlds largest data centers. For example, yahoo! Inc. launched the yahoo! search Webmap in 2008, a hadoop production application that ran on more than 10,000 core Linux clusters with over 5 petabytes of raw disk space. the analytics result was used in every yahoo! Web search query 5.
HADOOP CLUSTER
FINANCIAL MARKET DATA SOCIAL NETWORK DATA

Client Name Node

Data Node Job Tracker Client Data Node Data Node Delivery Results
Hadoop Network Tra c Big Data Process Flow

Finance

Data Node
Video Web Log Data

Data Node Client


Multimedia Apps

Data Collection

Storage and Analytics

Figure 1: Big data and Hadoop cluster

For further information on yahoo! Launches Worlds Largest hadoop Production application, visit http://developer.yahoo.com/blogs/hadoop/posts/2008/02/yahooworlds-largest-production-hadoop/.

Copyright 2012, Juniper Networks, Inc.

White Paper - Understanding Big Data and the QFabric System

The Networks Role in Hadoop


hadoop can run on most data center networks. however, legacy network architectures are not designed to handle modern distributed application architectures nor can they deliver the reliability and performance at scale demanded by big data. In fact, in a legacy multitier network design, the three main limiting factors that adversely impact performance and operation of a mid- to large size hadoop cluster are network complexity, degraded bandwidth, and inconsistent latency. Fortunately, organizations can achieve better results and reduce risks in running hadoop on a network designed for a world where server-to-server and server-to-storage traffic outweighs client-to-server traffic.

MULTITIER NETWORK

Core

Distribution 2 4

TOR 1 Client B1 Data Node Client Data Node Data Node Rack 1

TOR 2 Name Node Data Node Data Node Data Node Data Node Rack 2

TOR 3 Job Tracker Data Node Data Node Data Node Data Node Rack 3

TOR 8

TOR 9 Job Tracker 2 Data Node Data Node Data Node Data Node Rack 9

TOR 10 Client Data Node Data Node Data Node Data Node Rack 10

Name Node 2 Data Node Data Node B1 Data Node B1 Data Node Rack 8

Network Hops Hadoop Data Replication


Figure 2: The networks role in Hadoop

Data Reliability
a hadoop cluster is built on a fleet of low-cost, high capacity disk drives with a network-based data replication system and a data fault tolerant system simply because todays raId technology does not meet required, cost-effective data storage scalability. For the setting of three copies of data replication, hdFs places one replica on one data Node in a rack, while placing another two copies on two different data Nodes in different racks, thereby preventing data loss in the event of rack failure (see B1, B1, and B1 in Figure 2). this suboptimal data placement policy concerns the significantly reduced bandwidth of inter-rack communications as compared with intra-rack server-to-server bandwidth. hdFs relies on the network to maintain data reliability in the event of failures, which can be either a disk drive, server, or network device failure, or a combination of these failures. When a disk drive fails, the data replication event takes hours to maintain data reliability. For example, it takes approximately 66 minutes to transfer 3 tB of data on a 1gbe network without considering network latency. the event takes 4 hours and 24 minutes to transfer data when a server that employs four, 3 tB disk drives fails. When a subset of data Nodes loses connectivity with the Name Node, hdFs may become unreliable because the network does not have sufficient bandwidth to re-replicate large amounts of data.

Copyright 2012, Juniper Networks, Inc.

White Paper - Understanding Big Data and the QFabric System

Performance
In a multitier network, the network performance such as latency and bandwidth depends on design considerations and devices. as a result, inter-rack server-to-server network performance varies widely from one network to another of similar size. although the latency of a typical 10gbe top-of-rack (tOr) switch is around 1 microsecond, the latency of intermediate switches at the distribution tier and core tier is significantly higher than that of a tOr switch. also, a multitier network does not provide efficient inter-rack server-to-server bandwidth for a hadoop cluster because of the compounded oversubscription introduced by switches at the distribution and core tiers. For example, when each data node is configured with a 20 gbps connection, the intra-rack server-to-server network bandwidth is 20 gbps. the average inter-rack server-to-server network bandwidth between rack 1 and rack 2 (Figure 2) is 8 gbps when the tOr switch operates at oversubscription of 2.5:1, which means total server communication bandwidth is 2.5 times the inter-rack network bandwidth. When considering the oversubscription of 4:1 on a distribution switch, the compounded oversubscription becomes 10:1, a common deployment scenario in most three-tier networks. as a result, the average inter-rack server-to-server bandwidth for replicating B1 and B1 can be as low as 2 gbps.

Network Operation
a multitier network significantly increases network management complexity. For a multitier network such as the one shown in Figure 2, the tOr switch interconnects servers within the rack. to meet the demand of an increasing number of servers, many intermediate switches are required to build a tree topology with hierarch switches. since each switch represents a management endpoint, network redesign is often required for performance assurance, high availability, and capability planning, as the size of the hadoop cluster grows. Network management also becomes increasingly complex because of large numbers of endpoints and the increased number of devices involved in server-to-server connectivity and performance troubleshooting. For example, the network provision or troubleshooting between B1 and B1 may involve up to five devices, as shown in Figure 2. In addition to storing and processing big data, the hadoop cluster needs to collect data. most unstructured data, such as event data from active machine log files, can be staged in file systems on an array of servers. however, structured data, such as purchasing transactions and real-time inventory tracking, commonly resides on a Fibre Channel (FC)based disk array. to rapidly collect the combination of structured and unstructured data in a real-time fashion, the client nodes (shown in Figure 2) need a converged network to support direct, optimal access to data through the storage area network (saN). however, a multitier network may not support such a rich set of storage protocols for rapid data collection. In an ideal network, the copies of data replication can be placed into different, unique racks, and the large-scale network can be simplified and managed as simply as a single physical switch.

Copyright 2012, Juniper Networks, Inc.

White Paper - Understanding Big Data and the QFabric System

QFabric System Support for Big Data


the Juniper Networks QFX3500 switch can be deployed for a small size hadoop cluster as a high-performance, ultralow latency switch. the QFX3500 switch is a versatile, compact, high-density 10 gbps platform in a 1 U form factor that runs the same Juniper Networks Junos operating system software as other Juniper switches, routers, and security platforms. the QFX3500 delivers feature-rich L2 and L3 connectivity to networked devices such as rack and blade servers, storage systems, and other switches in highly demanding, high-performance data center environments. the QFX3500 offers standards-based Fibre Channel over ethernet (FCoe) ports to directly access data stored in the Fibre Channel-based saN. When deployed with other components of the Juniper Networks QFabric system, the QFX3500 delivers a fabric-ready QFabric Node edge solution.

QFabric Director QFabric Interconnect


Routing Engine (Fabric) (Route Engine)

Fabric

I/O Modules

QFabric Node
(I/O Modules)

CHASSIS SWITCH

DISTRIBUTED SWITCH

Figure 3: Comparison of a chassis switch and a QFabric System the QFabric system has the unique ability to support an entire data centerup to 6,144 10gbe portswith a single converged ethernet switch. as shown in Figure 3, similar to a standalone modular switch chassis that has three main function components (line cards, switch fabric, and routing engines), QFabric system is composed of three separate components in a distributed architecture.6 QFabric NodeLine card component of a QFabric system, which acts as the entry and exit into the fabric. Up to 128 QFabric Nodes can be interconnected in a single QFabric system. QFabric Interconnecthigh-speed transport device for interconnecting QFabric Nodes. a QFabric system supports up to 4 QFabric Interconnects. QFabric directordevice controller and services manager that delivers a common window for managing all components as a single device. the QFabric system is environmentally conscious, allowing enterprises to optimize every facet of the data center network while consuming less power, requiring less cooling, and producing a fraction of the carbon footprint for multitier data center networks. to achieve performance and economies of scale in hadoop, QFabric technology, with its simplified operation and consistent low-latency, is an ideal solution to build a big data network infrastructure that meets different organizations needs.

For further information, refer to the QFabric architectureImplementing a Flat data Center Network, at www.juniper.net/us/en/local/pdf/whitepapers/2000443-en.pdf.

Copyright 2012, Juniper Networks, Inc.

White Paper - Understanding Big Data and the QFabric System

Big Data Infrastructure Strategy


as big data pilots launch and business cases solidify, there are a number of changes occurring in network architectures that can enhance and help integrate big data processing and insights. Just as big data applications represent new ways of collecting, making sense of, and taking action on business data, the underlying network foundation of big data projects should now be considered in a new light. Network architectures can either enhance or inhibit the ability to easily begin, grow, and integrate big data initiatives from pilot to large-scale production. table 1 lists three hadoop cluster deployment sizes and the network capabilities that represent the growing demands for big data infrastructure. In a typical small size hadoop cluster, the QFX3500 provides the following network capabilities for 20 servers in a single rack: 40 non-blocking 10gbe ports to interconnect 20 servers, each with dual 10 gbps and teaming any-to-any server bandwidth, up to 20 gbps One network hop between any-to-any servers and extremely low any-to-any latency of less than 1 microsecond Feature-rich L2 and L3 connectivity One network management platform powered by Junos Os

Table 1: Hadoop Deployments and Network Capability Powered by QFabric System


Hadoop Deployment
hdFs size mapreduce processing capability

Small Size Hadoop


240 terabytes 240 cores
7

Mid-size Hadoop
2.4 petabytes 2,400 cores

Large Size Hadoop


24 petabytes 24,000 cores

Equipment
servers8 Network devices 20 1 standalone QFX3500 200 Large QFabric system 10 QFabric Nodes 2 QFabric Interconnects 2 QFabric Directors 10 racks (average) 2,000 Large QFabric system 100 QFabric Nodes 4 QFabric Interconnects 2 QFabric Directors 100 racks and more

deployment footprint

Up to 1 rack

Network Summary
hops between servers average inter-rack server communication bandwidth Intra-rack server-to-server communication bandwidth maximum server-to-server latency 1 Na 20 gbps 1 microsecond9 1 8 gbps 20 gbps 5 microseconds 1 8 gbps 20 gbps 5 microseconds

Operational Simplicity at Scale


as shown in Figure 4, one QFabric system, which consists of 10 QFabric Nodes interconnected by 2 QFabric Interconnects, can support a typical mid-size hadoop cluster with 200 servers, each with two 10 gbps connecting to QFabric Nodes. the same QFabric system can easily scale up to a large size hadoop cluster of 2,000 servers, which provide 10 times greater storage and processing capacity than a mid-size hadoop cluster by just adding two additional QFabric Interconnects and 90 QFabric Nodes.

7 8 9

1 petabyte = 1000 terabyte. each server is a 2 rU server with two six-core CPUs, 288 gB ram, four 3 tB hard drives, and a dual 10 gbps NIC. 1 microsecond = 1/1000 millisecond.

Copyright 2012, Juniper Networks, Inc.

White Paper - Understanding Big Data and the QFabric System

ONE QFABRIC SYSTEM

QFabric Interconnect 1 QFabric Node 1 Client B1 Data Node Client Data Node Data Node Rack 1 QFabric Node 2 Name Node Data Node Data Node Data Node Data Node Rack 2 QFabric Node 3 Job Tracker Data Node Data Node Data Node Data Node Rack 3 QFabric Node 8 Name Node 2 Data Node Data Node B1 Data Node Data Node Rack 8 QFabric Node 9 Job Tracker 2 Data Node Data Node B1 Data Node Data Node Rack 9 QFabric Node 10 Client Data Node Data Node Data Node Data Node Rack 10

Network Hops Hadoop Data Replication


Figure 4: An optimized mid-size Hadoop cluster with QFabric System With a QFabric system, the large size hadoop cluster provides the same performance as a mid-size hadoop cluster does: Intra-rack server communication bandwidth of 20 gbps (average inter-rack server communication bandwidth is 8 gbps) Feature-rich L2 and L3 connectivity to interconnect servers and other network devices such as routers and firewalls One network hop between any-to-any servers and extreme low any-to-any latency of less than 5 microseconds standards-based FCoe ports to directly access data stored in the FC-based saN. One network management powered by Junos Os as all QFabric Nodes are part of one logical switch, network operation such as provisioning and troubleshooting is greatly simplified. For example, the network operation between B1 and B1 only involves one logical device, as shown in Figure 4. In addition to the above-mentioned network operation simplicity, the QFabric system also offers benefits in power, cooling, space, and CapeX in a mid- and large size hadoop deployment. the QFabric system allows a hadoop cluster to collect data from an FC-based saN through a converged network. When a QFabric Node is configured as an FCoe transition switch or FCoe gateway, the client nodes can use an ethernet-based Converged Network adapter (CNa) to rapidly collect data in the FC-based saN. this saves the cost of investing in an FC host bus adapter (hBa) for each client node and greatly simplifies network management.

Data Reliability at Scale


supported by the QFabric system, hdFs can introduce a new data placement policy which will place three copies of data into three unique racks without affecting the write performance, significantly improving inter-rack bandwidth. For example, the average inter-rack network bandwidth between B1 and B1 is improved to 8 gbps. the new data placement policy will also improve the read performance, since data copies can be read in parallel from three different racks. With the QFabric system, organizations can also consider increasing the block size from 256 mB or 512 mB for better performance. high-performance networks can rapidly maintain data reliability and reduce risk of failure. When a 3 tB disk drive fails, a data replication takes approximately 7 minutes on a 10gbe network without considering network latency. the event takes 27 minutes to re-replicate data when a server that employs four 3 tB disk drives fails. and the lossless 10 gbps network architecture and high availability features provided by QFabric technology actually reduce the risk of network failure.

10

Copyright 2012, Juniper Networks, Inc.

White Paper - Understanding Big Data and the QFabric System

Performance at Scale
Linear performance scalability of hadoop allows organizations to predict and plan their infrastructure: by doubling the size of a cluster, organizations can process twice the amount of data in a given time or reduce the execution time of a given amount of data by half. QFabric architecture supports such performance scalability with consistent, extremely low, any-to-any latency and efficient inter-rack bandwidth. By eliminating the intermediate switches, the QFabric system operates at oversubscription of 2.5:1, which means 400 gbps bandwidth within the rack and 160 gbps interrack bandwidth in a mid- and large size hadoop cluster that consists of 20 servers per rack with each server employing two 10 gbps NICs (shown in table 1). the intra-rack server communication bandwidth is 20 gbps, while the average inter-rack server-to-server communication bandwidth is 8 gbps.

Conclusion
hadoop can run on most data center networks. however, legacy network architectures are not designed to handle modern distributed application architectures, nor can they deliver the reliability and performance at scale demanded by big data. Just as big data applications represent a new way of collecting, analyzing, and taking action on business data, using the Juniper Networks QFabric system as the underlying network foundation of big data projects should be considered in a new light. Network architectures can either enhance or inhibit the ability to easily initiate, grow, and integrate big data initiatives from pilot to large-scale production. thus, organizations should further consider the following key questions: If a pilot is successful, how big will the cluster become? What is the easiest way to add compute capacity without adding complexity and cost to running a cluster at scale? Over the lifetime of the cluster, which hadoop or other applications will be running on the cluster? how do we extend data output or inputs to legacy or other applications? With the QFabric system, organizations can easily build small, mid-size, and large size hadoop clusters. Compared to the multitiered data center network approach, the QFabric system helps businesses develop a more simplified network operation, improve hadoop performance, and optimize hadoop data reliability, as shown in table 2.

Table 2: Hadoop Benefits Comparing Multitier Network Approach and QFabric System
Hadoop Features
Operation at scale reliability at scale Performance at scale

Multitiered Network
Complexity grows as the size of the cluster grows Current data placement policy concerns the limited inter-rack bandwidth suboptimal performance due to ad-hoc design

QFabric System
simplified Optimized Optimized

With the evolving trend in big data combined with continued growth in data creation, big data analytics demand an elastic data center infrastructure to effectively collect big data, process big data, and deliver actionable information in real time. Juniper Networks QFabric system is a data center solution that offers a high-performance, scalable, big data infrastructure with simplified management. With QFabric technology, CIOs and CtOs no longer need to worry about disruptive transformations in their big data initiatives.

Copyright 2012, Juniper Networks, Inc.

11

White Paper - Understanding Big Data and the QFabric System

About Juniper Networks


Juniper Networks is in the business of network innovation. From devices to data centers, from consumers to cloud providers, Juniper Networks delivers the software, silicon and systems that transform the experience and economics of networking. the company serves customers and partners worldwide. additional information can be found at www.juniper.net.

Corporate and Sales Headquarters Juniper Networks, Inc. 1194 North mathilda avenue sunnyvale, Ca 94089 Usa Phone: 888.JUNIPer (888.586.4737) or 408.745.2000 Fax: 408.745.2100 www.juniper.net

APAC Headquarters Juniper Networks (hong Kong) 26/F, Cityplaza One 1111 Kings road taikoo shing, hong Kong Phone: 852.2332.3636 Fax: 852.2574.7803

EMEA Headquarters Juniper Networks Ireland airside Business Park swords, County dublin, Ireland Phone: 35.31.8903.600 emea sales: 00800.4586.4737 Fax: 35.31.8903.601

to purchase Juniper Networks solutions, please contact your Juniper Networks representative at 1-866-298-6428 or authorized reseller.

Copyright 2012 Juniper Networks, Inc. all rights reserved. Juniper Networks, the Juniper Networks logo, Junos, Netscreen, and screenOs are registered trademarks of Juniper Networks, Inc. in the United states and other countries. all other trademarks, service marks, registered marks, or registered service marks are the property of their respective owners. Juniper Networks assumes no responsibility for any inaccuracies in this document. Juniper Networks reserves the right to change, modify, transfer, or otherwise revise this publication without notice.

2000483-001-eN

June 2012

Printed on recycled paper

12

Copyright 2012, Juniper Networks, Inc.

You might also like