Understanding Big Data and The Qfabric System

White Paper
Copyright 2012, Juniper Networks, Inc. 1

UNDERSTANDING BIG DATA
AND THE QFABRIC SYSTEM
QFabric System Enables a High-Performance,
Scalable Big Data Infrastructure with Simplicity
2 Copyright 2012, Juniper Networks, Inc.
White Paper - Understanding Big Data and the QFabric System
Table of Contents
Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Big Data Use Cases in Healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Apache Hadoopthe Big Player . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
The Networks Role in Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Data Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Network Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
QFabric System Support for Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Big Data Infrastructure Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Operational Simplicity at Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Data Reliability at Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
Performance at Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
About Juniper Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
List of Figures
Figure 1: Big data and Hadoop cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Figure 2: The networks role in Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Figure 3: Comparison of a chassis switch and a QFabric System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Figure 4: An optimized mid-size Hadoop cluster with QFabric System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
Executive Summary
In todays increasingly complex and volatile business environment, organizations need to constantly adopt innovative
technologies to compete. Big data refers to a collection of data that is beyond the ability of typical database software
tools to collect, process, and deliver. When analyzed properly, big data can provide new business insights, open new
markets, and create new competitive advantage in many industries. According to MGI
1
, for example, retailers can realize
the potential of a sixty percent increase in operating margins by fully harnessing big data. In the healthcare industry, big
data can reduce costs and enhance patient outcomes in diagnosis and treatment by driving efficiency, transparency, and
quality. IDC expects the big data technology and services market to grow from $3.2 billion in 2010 to $16.9 billion in 2015,
making it one of the fastest growing areas in the overall information and communication technology (ICT) market
2
.
Big data has recently become an issue for organizations due to the dramatic increase in data creation and data
gathering, driven by a number of technological innovations. The rise of mobile users has increased the aggregation
of user statistics in the enterprise and, if properly synthesized and analyzed, these same statistics can provide highly
relevant and competitive business intelligence. The increasing use of sensors for everything from traffic patterns,
purchasing and buying behaviors, to real-time inventory management is another good example of the significant
increase in large data sets. Much of this data, gathered in real time, can provide unique and powerful intelligence,
especially if it can be analyzed and acted upon quickly.
Contrary to structured data that has historically been stored in data warehouses and analyzed with Structured Query
Language (SQL) analysis tools, big data requires a flat, horizontally scalable database, often with unique query tools
that work in real time (as opposed to time delineated snapshots). IT organizations must invest in new technologies and
architectures in order to best leverage and gain advantage from the power of these new massive real-time data streams.
In short, the big data phenomenon brings up a challenging question for CIOs and CTOs: What is the best big data
infrastructure strategy? Fortunately, as big data pilots launch and business cases solidify, there are a number of
changes occurring in network architectures that can enhance and help integrate big data processing and insights.
Just as big data applications represent a new way of collecting, analyzing, and taking action on business data, the
underlying network foundation of big data projects should be considered in a new light. Network architectures can
either enhance or inhibit the ability to easily launch, grow, and integrate big data initiatives from pilot projects to large-
scale production.
Consider Apache Hadoop, the de facto big data platform. To manage and process data in a server cluster, which is
required to scale to thousands of servers, the performance and manageability of the network is critical. In fact, most
data center infrastructures, especially the ones based on multitier networking, face operation and performance
challenges in storing and analyzing big data in a midsize or large size Hadoop cluster, which can interconnect
thousands of servers. Organizations need a network solution that overcomes these issues and enables them to gain
the considerable business intelligence benefits of big data analytics.
Organizations can achieve simplified management, improved performance, and optimized data reliability by using
Juniper Networks
QFabric
family of products. As a single logical switch, the QFabric System supports over 6,000
10GbE ports, and provides a consistent, extremely low latency of <5 microseconds even under load
3
. This paper
provides an overview of big data use cases, discusses the networks role in Hadoop clusters, and describes how QFabric
technology supports a high-performance and scalable big data infrastructure.
1
For further information on big data use cases, see the McKinsey Global Institute article, Big data: the next frontier for innovation, competition, and productivity, at:
www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation.
2
For further information concerning IDC Worldwide Big Data Technology and Services 2012-2015 Forecast, visit www.idc.com/getdoc.jsp?containerId=233485.
3
For further information concerning the QFabric System, visit www.juniper.net/us/en/products-services/switching/qfx-series/qfabric-system/#literature.
Introduction
Big data refers to a volume of data that is beyond the ability of typical database software tools to collect, process, and
deliver. When analyzed properly, big data can deliver new business insights, open new markets, and create competitive
advantage. Compared to structured data that historically has been stored through data warehousing stores and
analyzed with SQL analysis tools, big data has three major attributes: variety, volume, and velocity.
First, big data extends beyond structured data and includes semi-structured or unstructured data of all varieties.
This can include event data from active machine log files, text from social networking, data streams from financial
data services, click streams from customers accessing web applications, activities data from machine-to-machine
interchange, and even audio and video files.
Second, big data comes in large sized data sets because its predictability and analytics capabilities rely on sufficient
data points, dealing with everything from traffic patterns, purchasing and buying behaviors, or real-time inventory
management. Organizations are awash with data, easily amassing hundreds of terabytes and petabytes of information.
Third, organizations can maximize their datas business value by analyzing streaming. The rise of the security event
manager (SEM) industry is at the heart of gathering, analyzing, and proactively responding to event data from active
machine log files in real time, providing unique and powerful business intelligence.
Driven by a combination of technology innovation, ubiquitous social networking, and pervasive mobile devices, the rise
of big data has created an inflection point for organizations as they look for innovative ways to do business effectively
and economically.
Big Data Use Cases in Healthcare
According to a recent study,
4
using big data to address spiraling healthcare costs and make intelligent decisions by
analyzing information across departments and providers will greatly improve business efficiency and in the end,
improve patient outcomes. Besides the clear economic benefit, big data is so compelling in healthcare because of the
sheer volume of raw data generated, and the ability to rapidly analyze that data to enhance knowledge and improve
the quality of patient care.
Improved Diagnosis and Treatment. Leveraging big data can greatly aid in diagnosing and treating illnesses, improving
patient outcomes, and in many cases saving lives. Considering that up to 25 percent of all diagnoses are not supported
by any data analytics, and that 100,000 Americans die as a direct result of medical errors every year, improving patient
outcomes is a key focus for the industry.
Disease Trend Analysis. The global nature of society has had a profound impact on the spread of disease. Regional
diseases can quickly develop into global pandemics, which is why the industry is leveraging big data to ofer much needed
disease trend analysis. The worlds governments had a genuine concern over the avian inuenza pandemic of 2009, and
while the actual conrmed deathsapproximately 15,000were less than feared, other pandemics such as West Nile
virus, mad-cow disease and tuberculosis are not only a present day concern but unfortunately a serious concern for the
future as well. Big data can be used to predict and reduce future pandemics.
Predictive Analysis. How many lives would have been saved if the industry had the ability to more quickly see the
correlation between Vioxx and heart attacks? In congressional testimony regarding Vioxx, Dr. David Graham stated
that conservatively 100,000 people have had heart attacks as a result of using Vioxx, leading to between 30,000 and
40,000 deaths.
The way big data can positively impact healthcare is clear. In fact, the McKinsey Global Institute report (refer to
footnote 1) opined that the healthcare sector could create more than $300 billion in value every year by implementing
big data analytics. A significant amount of data is available and ready for analysis, so the time is right to make sure
that big data delivers this predicted value.

4
For further information on the article, How Big Data Can Mend Our Broken Healthcare System, Ewing Marion Kauffman Foundation (April 2012), visit
www.smartplanet.com/blog/business-brains/how-big-data-can-mend-our-broken-healthcare-system-study/23728.
Apache Hadoopthe Big Player
Apache Hadoop is a widely deployed platform for managing and processing big data. Many companies such as IBM,
MapR Technologies, and Cloudera provide commercially licensed Hadoop software stacks with improved performance
and/or value-added features. Hadoop includes the following two key functions:
Hadoop Distributed File System (HDFS) is a large-scale distributed le system, which consists of hundreds and
thousands of servers, each storing part of the le systems data. HDFS not only provides high aggregate data bandwidth
to handle large datasets, but it also addresses frequent hardware failures in a large size deployment.
MapReduce is a distributed analytics program designed for big data. It maps and processes input records in each server,
collects and shufes intermediate results through the network, and reduces them to a nal result. The performance of
each taskthe execution time between start and completion of a MapReduce taskcan vary from minutes to hours
depending on data complexity and size of the Hadoop cluster.
In a typical Hadoop cluster, as shown in Figure 1, each server may take any combination of the following four roles:
The Client collects data from internal data sources within the data center or from external data sources on the cloud
services, submits MapReduce tasks to the Job Tracker, and delivers the analytics results to business applications.
The Job Tracker schedules the MapReduce tasks and ultimately produces analytics results.
The Name Node manages HDFS metadata, by keeping track of the location of each les data blocks throughout the
cluster. The data block default size is 64 MB, but it can be set to 512 MB or 1 GB depending on network performance.
The Data Node holds HDFS data blocks on its local drives, but also executes tasks assigned from the Job Tracker. The
Hadoop cluster acts as a unied storage and analytics pool with hundreds or thousands of Data Nodes.
Hadoop has been tested and deployed by some of the worlds largest data centers. For example, Yahoo! Inc. launched
the Yahoo! Search Webmap in 2008, a Hadoop production application that ran on more than 10,000 core Linux
clusters with over 5 petabytes of raw disk space. The analytics result was used in every Yahoo! Web search query
5
.
Figure 1: Big data and Hadoop cluster
Apps
Video Log Data
Multimedia
Finance
Web
HADOOP CLUSTER
Data Node
Name Node Job Tracker
Client
Data Node Data Node
Data Node Client
Client Data Node
Hadoop Network Trafc
Big Data Process Flow
Storage and Analytics Data Collection Delivery Results
FINANCIAL
MARKET
DATA
SOCIAL
NETWORK
DATA
5
For further information on Yahoo! Launches Worlds Largest Hadoop Production Application, visit http://developer.yahoo.com/blogs/hadoop/posts/2008/02/yahoo-
worlds-largest-production-hadoop/.
The Networks Role in Hadoop
Hadoop can run on most data center networks. However, legacy network architectures are not designed to handle
modern distributed application architectures nor can they deliver the reliability and performance at scale demanded by
big data. In fact, in a legacy multitier network design, the three main limiting factors that adversely impact performance
and operation of a mid- to large size Hadoop cluster are network complexity, degraded bandwidth, and inconsistent
latency. Fortunately, organizations can achieve better results and reduce risks in running Hadoop on a network designed
for a world where server-to-server and server-to-storage traffic outweighs client-to-server traffic.
Figure 2: The networks role in Hadoop
Data Reliability
A Hadoop cluster is built on a fleet of low-cost, high capacity disk drives with a network-based data replication system
and a data fault tolerant system simply because todays RAID technology does not meet required, cost-effective data
storage scalability. For the setting of three copies of data replication, HDFS places one replica on one Data Node in a rack,
while placing another two copies on two different Data Nodes in different racks, thereby preventing data loss in the event
of rack failure (see B1, B1, and B1 in Figure 2). This suboptimal data placement policy concerns the significantly reduced
bandwidth of inter-rack communications as compared with intra-rack server-to-server bandwidth.
HDFS relies on the network to maintain data reliability in the event of failures, which can be either a disk drive, server, or
network device failure, or a combination of these failures. When a disk drive fails, the data replication event takes hours
to maintain data reliability. For example, it takes approximately 66 minutes to transfer 3 TB of data on a 1GbE network
without considering network latency. The event takes 4 hours and 24 minutes to transfer data when a server that
employs four, 3 TB disk drives fails. When a subset of Data Nodes loses connectivity with the Name Node, HDFS may
become unreliable because the network does not have sufficient bandwidth to re-replicate large amounts of data.
MULTITIER NETWORK
Network Hops
Hadoop Data Replication
Core
Distribution
2 4
3
Rack 8
TOR 8
Name Node 2
Data Node
Data Node
Data Node
Data Node
Rack 9
TOR 9
Job Tracker 2
Data Node
Data Node
Data Node
Data Node
Rack 10
TOR 10
Client
Data Node
Data Node
Data Node
Data Node
Rack 1
TOR 1
Client
Data Node
Client
Data Node
Data Node
Rack 2
TOR 2
Name Node
Data Node
Data Node
Data Node
Data Node
Rack 3
TOR 3
Job Tracker
Data Node
Data Node
Data Node
Data Node
1 5
B1
B1
B1
Performance
In a multitier network, the network performance such as latency and bandwidth depends on design considerations
and devices. As a result, inter-rack server-to-server network performance varies widely from one network to another of
similar size. Although the latency of a typical 10GbE top-of-rack (TOR) switch is around 1 microsecond, the latency of
intermediate switches at the distribution tier and core tier is significantly higher than that of a TOR switch.
Also, a multitier network does not provide efficient inter-rack server-to-server bandwidth for a Hadoop cluster because
of the compounded oversubscription introduced by switches at the distribution and core tiers. For example, when each
data node is configured with a 20 Gbps connection, the intra-rack server-to-server network bandwidth is 20 Gbps.
The average inter-rack server-to-server network bandwidth between Rack 1 and Rack 2 (Figure 2) is 8 Gbps when the
TOR switch operates at oversubscription of 2.5:1, which means total server communication bandwidth is 2.5 times the
inter-rack network bandwidth. When considering the oversubscription of 4:1 on a distribution switch, the compounded
oversubscription becomes 10:1, a common deployment scenario in most three-tier networks. As a result, the average
inter-rack server-to-server bandwidth for replicating B1 and B1 can be as low as 2 Gbps.
Network Operation
A multitier network significantly increases network management complexity. For a multitier network such as the one
shown in Figure 2, the TOR switch interconnects servers within the rack. To meet the demand of an increasing number
of servers, many intermediate switches are required to build a tree topology with hierarch switches. Since each switch
represents a management endpoint, network redesign is often required for performance assurance, high availability,
and capability planning, as the size of the Hadoop cluster grows. Network management also becomes increasingly
complex because of large numbers of endpoints and the increased number of devices involved in server-to-server
connectivity and performance troubleshooting. For example, the network provision or troubleshooting between B1 and
B1 may involve up to five devices, as shown in Figure 2.
In addition to storing and processing big data, the Hadoop cluster needs to collect data. Most unstructured data, such
as event data from active machine log files, can be staged in file systems on an array of servers. However, structured
data, such as purchasing transactions and real-time inventory tracking, commonly resides on a Fibre Channel (FC)-
based disk array. To rapidly collect the combination of structured and unstructured data in a real-time fashion, the
client nodes (shown in Figure 2) need a converged network to support direct, optimal access to data through the
storage area network (SAN). However, a multitier network may not support such a rich set of storage protocols for
rapid data collection.
In an ideal network, the copies of data replication can be placed into different, unique racks, and the large-scale
network can be simplified and managed as simply as a single physical switch.
QFabric System Support for Big Data
The Juniper Networks QFX3500 Switch can be deployed for a small size Hadoop cluster as a high-performance, ultra-
low latency switch. The QFX3500 Switch is a versatile, compact, high-density 10 Gbps platform in a 1 U form factor
that runs the same Juniper Networks Junos
operating system software as other Juniper switches, routers, and security

platforms. The QFX3500 delivers feature-rich L2 and L3 connectivity to networked devices such as rack and blade
servers, storage systems, and other switches in highly demanding, high-performance data center environments. The
QFX3500 offers standards-based Fibre Channel over Ethernet (FCoE) ports to directly access data stored in the Fibre
Channel-based SAN.
When deployed with other components of the Juniper Networks QFabric System, the QFX3500 delivers a fabric-ready
QFabric Node edge solution.
Figure 3: Comparison of a chassis switch and a QFabric System
The QFabric System has the unique ability to support an entire data centerup to 6,144 10GbE portswith a single
converged Ethernet switch. As shown in Figure 3, similar to a standalone modular switch chassis that has three main
function components (line cards, switch fabric, and Routing Engines), QFabric System is composed of three separate
components in a distributed architecture.
6
QFabric NodeLine card component of a QFabric System, which acts as the entry and exit into the fabric. Up to 128
QFabric Nodes can be interconnected in a single QFabric System.
QFabric InterconnectHigh-speed transport device for interconnecting QFabric Nodes. A QFabric System supports up to
4 QFabric Interconnects.
QFabric DirectorDevice controller and services manager that delivers a common window for managing all components
as a single device.
The QFabric System is environmentally conscious, allowing enterprises to optimize every facet of the data center
network while consuming less power, requiring less cooling, and producing a fraction of the carbon footprint for
multitier data center networks. To achieve performance and economies of scale in Hadoop, QFabric technology, with
its simplified operation and consistent low-latency, is an ideal solution to build a big data network infrastructure that
meets different organizations needs.
CHASSIS SWITCH DISTRIBUTED SWITCH
I/O Modules
QFabric Director
(Route Engine)
QFabric Interconnect
(Fabric)
QFabric Node
(I/O Modules)
Fabric
Routing Engine
6
For further information, refer to The QFabric ArchitectureImplementing a Flat Data Center Network, at www.juniper.net/us/en/local/pdf/whitepapers/2000443-en.pdf.
Big Data Infrastructure Strategy
As big data pilots launch and business cases solidify, there are a number of changes occurring in network architectures
that can enhance and help integrate big data processing and insights. Just as big data applications represent new
ways of collecting, making sense of, and taking action on business data, the underlying network foundation of big data
projects should now be considered in a new light. Network architectures can either enhance or inhibit the ability to
easily begin, grow, and integrate big data initiatives from pilot to large-scale production.
Table 1 lists three Hadoop cluster deployment sizes and the network capabilities that represent the growing demands
for big data infrastructure. In a typical small size Hadoop cluster, the QFX3500 provides the following network
capabilities for 20 servers in a single rack:
40 non-blocking 10GbE ports to interconnect 20 servers, each with dual 10 Gbps and teaming
Any-to-any server bandwidth, up to 20 Gbps
One network hop between any-to-any servers and extremely low any-to-any latency of less than 1 microsecond
Feature-rich L2 and L3 connectivity
One network management platform powered by Junos OS
Table 1: Hadoop Deployments and Network Capability Powered by QFabric System
Hadoop Deployment Small Size Hadoop Mid-size Hadoop Large Size Hadoop
HDFS size 240 terabytes
7
2.4 petabytes 24 petabytes
MapReduce processing capability 240 cores 2,400 cores 24,000 cores
Equipment
Servers
8
20 200 2,000
Network devices 1 standalone QFX3500 Large QFabric System
10 QFabric Nodes
2 QFabric Interconnects
2 QFabric Directors
Large QFabric System
100 QFabric Nodes
4 QFabric Interconnects
2 QFabric Directors
Deployment footprint Up to 1 rack 10 racks (average) 100 racks and more
Network Summary
Hops between servers 1 1 1
Average inter-rack server
communication bandwidth
NA 8 Gbps 8 Gbps
Intra-rack server-to-server
communication bandwidth
20 Gbps 20 Gbps 20 Gbps
Maximum server-to-server latency 1 microsecond
9
5 microseconds 5 microseconds
Operational Simplicity at Scale
As shown in Figure 4, one QFabric System, which consists of 10 QFabric Nodes interconnected by 2 QFabric
Interconnects, can support a typical mid-size Hadoop cluster with 200 servers, each with two 10 Gbps connecting to
QFabric Nodes. The same QFabric System can easily scale up to a large size Hadoop cluster of 2,000 servers, which
provide 10 times greater storage and processing capacity than a mid-size Hadoop cluster by just adding two additional
QFabric Interconnects and 90 QFabric Nodes.
7
1 petabyte = 1000 terabyte.
8
Each server is a 2 RU server with two six-core CPUs, 288 GB RAM, four 3 TB hard drives, and a dual 10 Gbps NIC.
9
1 microsecond = 1/1000 millisecond.
Figure 4: An optimized mid-size Hadoop cluster with QFabric System
With a QFabric System, the large size Hadoop cluster provides the same performance as a mid-size Hadoop cluster does:
Intra-rack server communication bandwidth of 20 Gbps (average inter-rack server communication bandwidth is 8 Gbps)
Feature-rich L2 and L3 connectivity to interconnect servers and other network devices such as routers and rewalls
One network hop between any-to-any servers and extreme low any-to-any latency of less than 5 microseconds
Standards-based FCoE ports to directly access data stored in the FC-based SAN.
One network management powered by Junos OS
As all QFabric Nodes are part of one logical switch, network operation such as provisioning and troubleshooting is
greatly simplified. For example, the network operation between B1 and B1 only involves one logical device, as shown in
Figure 4. In addition to the above-mentioned network operation simplicity, the QFabric System also offers benefits in
power, cooling, space, and CapEX in a mid- and large size Hadoop deployment.
The QFabric System allows a Hadoop cluster to collect data from an FC-based SAN through a converged network.
When a QFabric Node is configured as an FCoE transition switch or FCoE gateway, the client nodes can use an
Ethernet-based Converged Network Adapter (CNA) to rapidly collect data in the FC-based SAN. This saves the cost of
investing in an FC host bus adapter (HBA) for each client node and greatly simplifies network management.
Data Reliability at Scale
Supported by the QFabric System, HDFS can introduce a new data placement policy which will place three copies of
data into three unique racks without affecting the write performance, significantly improving inter-rack bandwidth.
For example, the average inter-rack network bandwidth between B1 and B1 is improved to 8 Gbps. The new data
placement policy will also improve the read performance, since data copies can be read in parallel from three different
racks. With the QFabric System, organizations can also consider increasing the block size from 256 MB or 512 MB for
better performance.
High-performance networks can rapidly maintain data reliability and reduce risk of failure. When a 3 TB disk drive fails, a
data replication takes approximately 7 minutes on a 10GbE network without considering network latency. The event takes
27 minutes to re-replicate data when a server that employs four 3 TB disk drives fails. And the lossless 10 Gbps network
architecture and high availability features provided by QFabric technology actually reduce the risk of network failure.
ONE QFABRIC SYSTEM
1
Rack 8
QFabric
Node 8
Name Node 2
Data Node
Data Node
Data Node
Data Node
Rack 9
QFabric
Node 9
Job Tracker 2
Data Node
Data Node
Data Node
Data Node
Rack 10
QFabric
Node 10
Client
Data Node
Data Node
Data Node
Data Node
Rack 1
QFabric
Node 1
Client
Data Node
Client
Data Node
Data Node
Rack 2
QFabric
Node 2
Name Node
Data Node
Data Node
Data Node
Data Node
Rack 3
QFabric
Node 3
Job Tracker
Data Node
Data Node
Data Node
Data Node
B1
B1 B1
QFabric
Interconnect
Network Hops
Hadoop Data Replication
Performance at Scale
Linear performance scalability of Hadoop allows organizations to predict and plan their infrastructure: by doubling the
size of a cluster, organizations can process twice the amount of data in a given time or reduce the execution time of a
given amount of data by half. QFabric architecture supports such performance scalability with consistent, extremely
low, any-to-any latency and efficient inter-rack bandwidth. By eliminating the intermediate switches, the QFabric
System operates at oversubscription of 2.5:1, which means 400 Gbps bandwidth within the rack and 160 Gbps inter-
rack bandwidth in a mid- and large size Hadoop cluster that consists of 20 servers per rack with each server employing
two 10 Gbps NICs (shown in Table 1). The intra-rack server communication bandwidth is 20 Gbps, while the average
inter-rack server-to-server communication bandwidth is 8 Gbps.
Conclusion
Hadoop can run on most data center networks. However, legacy network architectures are not designed to handle
modern distributed application architectures, nor can they deliver the reliability and performance at scale demanded
by big data. Just as big data applications represent a new way of collecting, analyzing, and taking action on business
data, using the Juniper Networks QFabric System as the underlying network foundation of big data projects should
be considered in a new light. Network architectures can either enhance or inhibit the ability to easily initiate, grow,
and integrate big data initiatives from pilot to large-scale production. Thus, organizations should further consider the
following key questions:
If a pilot is successful, how big will the cluster become?
What is the easiest way to add compute capacity without adding complexity and cost to running a cluster at scale?
Over the lifetime of the cluster, which Hadoop or other applications will be running on the cluster?
How do we extend data output or inputs to legacy or other applications?
With the QFabric System, organizations can easily build small, mid-size, and large size Hadoop clusters. Compared to
the multitiered data center network approach, the QFabric System helps businesses develop a more simplified network
operation, improve Hadoop performance, and optimize Hadoop data reliability, as shown in Table 2.
Table 2: Hadoop Benets Comparing Multitier Network Approach and QFabric System
Hadoop Features Multitiered Network QFabric System
Operation at scale Complexity grows as the size of the
cluster grows
Simplied
Reliability at scale Current data placement policy concerns
the limited inter-rack bandwidth
Optimized
Performance at scale Suboptimal performance due to ad-hoc
design
Optimized
With the evolving trend in big data combined with continued growth in data creation, big data analytics demand an
elastic data center infrastructure to effectively collect big data, process big data, and deliver actionable information in
real time. Juniper Networks QFabric System is a data center solution that offers a high-performance, scalable, big data
infrastructure with simplified management. With QFabric technology, CIOs and CTOs no longer need to worry about
disruptive transformations in their big data initiatives.
2000483-001-EN June 2012
Copyright 2012 Juniper Networks, Inc. All rights reserved. Juniper Networks, the Juniper Networks logo, Junos,
NetScreen, and ScreenOS are registered trademarks of Juniper Networks, Inc. in the United States and other
countries. All other trademarks, service marks, registered marks, or registered service marks are the property of
their respective owners. Juniper Networks assumes no responsibility for any inaccuracies in this document. Juniper
Networks reserves the right to change, modify, transfer, or otherwise revise this publication without notice.
EMEA Headquarters
Juniper Networks Ireland
Airside Business Park
Swords, County Dublin, Ireland
Phone: 35.31.8903.600
EMEA Sales: 00800.4586.4737
Fax: 35.31.8903.601
APAC Headquarters
Juniper Networks (Hong Kong)
26/F, Cityplaza One
1111 Kings Road
Taikoo Shing, Hong Kong
Phone: 852.2332.3636
Fax: 852.2574.7803
Corporate and Sales Headquarters
Juniper Networks, Inc.
1194 North Mathilda Avenue
Sunnyvale, CA 94089 USA
Phone: 888.JUNIPER (888.586.4737)
or 408.745.2000
Fax: 408.745.2100
www.juniper.net
To purchase Juniper Networks solutions,
please contact your Juniper Networks
representative at 1-866-298-6428 or
authorized reseller.
Printed on recycled paper
About Juniper Networks
Juniper Networks is in the business of network innovation. From devices to data centers, from consumers to cloud
providers, Juniper Networks delivers the software, silicon and systems that transform the experience and economics
of networking. The company serves customers and partners worldwide. Additional information can be found at
www.juniper.net.

Understanding Big Data and The Qfabric System

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Understanding Big Data and The Qfabric System

Uploaded by

Copyright:

Available Formats

White Paper

Copyright 2012, Juniper Networks, Inc. 1

operating system software as other Juniper switches, routers, and security

You might also like