Professional Documents
Culture Documents
QFabric
family of products. As a single logical switch, the QFabric System supports over 6,000
10GbE ports, and provides a consistent, extremely low latency of <5 microseconds even under load
3
. This paper
provides an overview of big data use cases, discusses the networks role in Hadoop clusters, and describes how QFabric
technology supports a high-performance and scalable big data infrastructure.
1
For further information on big data use cases, see the McKinsey Global Institute article, Big data: the next frontier for innovation, competition, and productivity, at:
www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation.
2
For further information concerning IDC Worldwide Big Data Technology and Services 2012-2015 Forecast, visit www.idc.com/getdoc.jsp?containerId=233485.
3
For further information concerning the QFabric System, visit www.juniper.net/us/en/products-services/switching/qfx-series/qfabric-system/#literature.
4 Copyright 2012, Juniper Networks, Inc.
White Paper - Understanding Big Data and the QFabric System
Introduction
Big data refers to a volume of data that is beyond the ability of typical database software tools to collect, process, and
deliver. When analyzed properly, big data can deliver new business insights, open new markets, and create competitive
advantage. Compared to structured data that historically has been stored through data warehousing stores and
analyzed with SQL analysis tools, big data has three major attributes: variety, volume, and velocity.
First, big data extends beyond structured data and includes semi-structured or unstructured data of all varieties.
This can include event data from active machine log files, text from social networking, data streams from financial
data services, click streams from customers accessing web applications, activities data from machine-to-machine
interchange, and even audio and video files.
Second, big data comes in large sized data sets because its predictability and analytics capabilities rely on sufficient
data points, dealing with everything from traffic patterns, purchasing and buying behaviors, or real-time inventory
management. Organizations are awash with data, easily amassing hundreds of terabytes and petabytes of information.
Third, organizations can maximize their datas business value by analyzing streaming. The rise of the security event
manager (SEM) industry is at the heart of gathering, analyzing, and proactively responding to event data from active
machine log files in real time, providing unique and powerful business intelligence.
Driven by a combination of technology innovation, ubiquitous social networking, and pervasive mobile devices, the rise
of big data has created an inflection point for organizations as they look for innovative ways to do business effectively
and economically.
Big Data Use Cases in Healthcare
According to a recent study,
4
using big data to address spiraling healthcare costs and make intelligent decisions by
analyzing information across departments and providers will greatly improve business efficiency and in the end,
improve patient outcomes. Besides the clear economic benefit, big data is so compelling in healthcare because of the
sheer volume of raw data generated, and the ability to rapidly analyze that data to enhance knowledge and improve
the quality of patient care.
Improved Diagnosis and Treatment. Leveraging big data can greatly aid in diagnosing and treating illnesses, improving
patient outcomes, and in many cases saving lives. Considering that up to 25 percent of all diagnoses are not supported
by any data analytics, and that 100,000 Americans die as a direct result of medical errors every year, improving patient
outcomes is a key focus for the industry.
Disease Trend Analysis. The global nature of society has had a profound impact on the spread of disease. Regional
diseases can quickly develop into global pandemics, which is why the industry is leveraging big data to ofer much needed
disease trend analysis. The worlds governments had a genuine concern over the avian inuenza pandemic of 2009, and
while the actual conrmed deathsapproximately 15,000were less than feared, other pandemics such as West Nile
virus, mad-cow disease and tuberculosis are not only a present day concern but unfortunately a serious concern for the
future as well. Big data can be used to predict and reduce future pandemics.
Predictive Analysis. How many lives would have been saved if the industry had the ability to more quickly see the
correlation between Vioxx and heart attacks? In congressional testimony regarding Vioxx, Dr. David Graham stated
that conservatively 100,000 people have had heart attacks as a result of using Vioxx, leading to between 30,000 and
40,000 deaths.
The way big data can positively impact healthcare is clear. In fact, the McKinsey Global Institute report (refer to
footnote 1) opined that the healthcare sector could create more than $300 billion in value every year by implementing
big data analytics. A significant amount of data is available and ready for analysis, so the time is right to make sure
that big data delivers this predicted value.
4
For further information on the article, How Big Data Can Mend Our Broken Healthcare System, Ewing Marion Kauffman Foundation (April 2012), visit
www.smartplanet.com/blog/business-brains/how-big-data-can-mend-our-broken-healthcare-system-study/23728.
Copyright 2012, Juniper Networks, Inc. 5
White Paper - Understanding Big Data and the QFabric System
Apache Hadoopthe Big Player
Apache Hadoop is a widely deployed platform for managing and processing big data. Many companies such as IBM,
MapR Technologies, and Cloudera provide commercially licensed Hadoop software stacks with improved performance
and/or value-added features. Hadoop includes the following two key functions:
Hadoop Distributed File System (HDFS) is a large-scale distributed le system, which consists of hundreds and
thousands of servers, each storing part of the le systems data. HDFS not only provides high aggregate data bandwidth
to handle large datasets, but it also addresses frequent hardware failures in a large size deployment.
MapReduce is a distributed analytics program designed for big data. It maps and processes input records in each server,
collects and shufes intermediate results through the network, and reduces them to a nal result. The performance of
each taskthe execution time between start and completion of a MapReduce taskcan vary from minutes to hours
depending on data complexity and size of the Hadoop cluster.
In a typical Hadoop cluster, as shown in Figure 1, each server may take any combination of the following four roles:
The Client collects data from internal data sources within the data center or from external data sources on the cloud
services, submits MapReduce tasks to the Job Tracker, and delivers the analytics results to business applications.
The Job Tracker schedules the MapReduce tasks and ultimately produces analytics results.
The Name Node manages HDFS metadata, by keeping track of the location of each les data blocks throughout the
cluster. The data block default size is 64 MB, but it can be set to 512 MB or 1 GB depending on network performance.
The Data Node holds HDFS data blocks on its local drives, but also executes tasks assigned from the Job Tracker. The
Hadoop cluster acts as a unied storage and analytics pool with hundreds or thousands of Data Nodes.
Hadoop has been tested and deployed by some of the worlds largest data centers. For example, Yahoo! Inc. launched
the Yahoo! Search Webmap in 2008, a Hadoop production application that ran on more than 10,000 core Linux
clusters with over 5 petabytes of raw disk space. The analytics result was used in every Yahoo! Web search query
5
.
Figure 1: Big data and Hadoop cluster
Apps
Video Log Data
Multimedia
Finance
Web
HADOOP CLUSTER
Data Node
Name Node Job Tracker
Client
Data Node Data Node
Data Node Client
Client Data Node
Hadoop Network Trafc
Big Data Process Flow
Storage and Analytics Data Collection Delivery Results
FINANCIAL
MARKET
DATA
SOCIAL
NETWORK
DATA
5
For further information on Yahoo! Launches Worlds Largest Hadoop Production Application, visit http://developer.yahoo.com/blogs/hadoop/posts/2008/02/yahoo-
worlds-largest-production-hadoop/.
6 Copyright 2012, Juniper Networks, Inc.
White Paper - Understanding Big Data and the QFabric System
The Networks Role in Hadoop
Hadoop can run on most data center networks. However, legacy network architectures are not designed to handle
modern distributed application architectures nor can they deliver the reliability and performance at scale demanded by
big data. In fact, in a legacy multitier network design, the three main limiting factors that adversely impact performance
and operation of a mid- to large size Hadoop cluster are network complexity, degraded bandwidth, and inconsistent
latency. Fortunately, organizations can achieve better results and reduce risks in running Hadoop on a network designed
for a world where server-to-server and server-to-storage traffic outweighs client-to-server traffic.
Figure 2: The networks role in Hadoop
Data Reliability
A Hadoop cluster is built on a fleet of low-cost, high capacity disk drives with a network-based data replication system
and a data fault tolerant system simply because todays RAID technology does not meet required, cost-effective data
storage scalability. For the setting of three copies of data replication, HDFS places one replica on one Data Node in a rack,
while placing another two copies on two different Data Nodes in different racks, thereby preventing data loss in the event
of rack failure (see B1, B1, and B1 in Figure 2). This suboptimal data placement policy concerns the significantly reduced
bandwidth of inter-rack communications as compared with intra-rack server-to-server bandwidth.
HDFS relies on the network to maintain data reliability in the event of failures, which can be either a disk drive, server, or
network device failure, or a combination of these failures. When a disk drive fails, the data replication event takes hours
to maintain data reliability. For example, it takes approximately 66 minutes to transfer 3 TB of data on a 1GbE network
without considering network latency. The event takes 4 hours and 24 minutes to transfer data when a server that
employs four, 3 TB disk drives fails. When a subset of Data Nodes loses connectivity with the Name Node, HDFS may
become unreliable because the network does not have sufficient bandwidth to re-replicate large amounts of data.
MULTITIER NETWORK
Network Hops
Hadoop Data Replication
Core
Distribution
2 4
3
Rack 8
TOR 8
Name Node 2
Data Node
Data Node
Data Node
Data Node
Rack 9
TOR 9
Job Tracker 2
Data Node
Data Node
Data Node
Data Node
Rack 10
TOR 10
Client
Data Node
Data Node
Data Node
Data Node
Rack 1
TOR 1
Client
Data Node
Client
Data Node
Data Node
Rack 2
TOR 2
Name Node
Data Node
Data Node
Data Node
Data Node
Rack 3
TOR 3
Job Tracker
Data Node
Data Node
Data Node
Data Node
1 5
B1
B1
B1
Copyright 2012, Juniper Networks, Inc. 7
White Paper - Understanding Big Data and the QFabric System
Performance
In a multitier network, the network performance such as latency and bandwidth depends on design considerations
and devices. As a result, inter-rack server-to-server network performance varies widely from one network to another of
similar size. Although the latency of a typical 10GbE top-of-rack (TOR) switch is around 1 microsecond, the latency of
intermediate switches at the distribution tier and core tier is significantly higher than that of a TOR switch.
Also, a multitier network does not provide efficient inter-rack server-to-server bandwidth for a Hadoop cluster because
of the compounded oversubscription introduced by switches at the distribution and core tiers. For example, when each
data node is configured with a 20 Gbps connection, the intra-rack server-to-server network bandwidth is 20 Gbps.
The average inter-rack server-to-server network bandwidth between Rack 1 and Rack 2 (Figure 2) is 8 Gbps when the
TOR switch operates at oversubscription of 2.5:1, which means total server communication bandwidth is 2.5 times the
inter-rack network bandwidth. When considering the oversubscription of 4:1 on a distribution switch, the compounded
oversubscription becomes 10:1, a common deployment scenario in most three-tier networks. As a result, the average
inter-rack server-to-server bandwidth for replicating B1 and B1 can be as low as 2 Gbps.
Network Operation
A multitier network significantly increases network management complexity. For a multitier network such as the one
shown in Figure 2, the TOR switch interconnects servers within the rack. To meet the demand of an increasing number
of servers, many intermediate switches are required to build a tree topology with hierarch switches. Since each switch
represents a management endpoint, network redesign is often required for performance assurance, high availability,
and capability planning, as the size of the Hadoop cluster grows. Network management also becomes increasingly
complex because of large numbers of endpoints and the increased number of devices involved in server-to-server
connectivity and performance troubleshooting. For example, the network provision or troubleshooting between B1 and
B1 may involve up to five devices, as shown in Figure 2.
In addition to storing and processing big data, the Hadoop cluster needs to collect data. Most unstructured data, such
as event data from active machine log files, can be staged in file systems on an array of servers. However, structured
data, such as purchasing transactions and real-time inventory tracking, commonly resides on a Fibre Channel (FC)-
based disk array. To rapidly collect the combination of structured and unstructured data in a real-time fashion, the
client nodes (shown in Figure 2) need a converged network to support direct, optimal access to data through the
storage area network (SAN). However, a multitier network may not support such a rich set of storage protocols for
rapid data collection.
In an ideal network, the copies of data replication can be placed into different, unique racks, and the large-scale
network can be simplified and managed as simply as a single physical switch.
8 Copyright 2012, Juniper Networks, Inc.
White Paper - Understanding Big Data and the QFabric System
QFabric System Support for Big Data
The Juniper Networks QFX3500 Switch can be deployed for a small size Hadoop cluster as a high-performance, ultra-
low latency switch. The QFX3500 Switch is a versatile, compact, high-density 10 Gbps platform in a 1 U form factor
that runs the same Juniper Networks Junos