Professional Documents
Culture Documents
Table of Contents
executive summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Big data Use Cases in healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 apache hadoopthe Big Player . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 the Networks role in hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 data reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 Network Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 QFabric system support for Big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Big data Infrastructure strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Operational simplicity at scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 data reliability at scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 Performance at scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 about Juniper Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
List of Figures
Figure 1: Big data and hadoop cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Figure 2: the networks role in hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Figure 3: Comparison of a chassis switch and a QFabric system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Figure 4: an optimized mid-size hadoop cluster with QFabric system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
Executive Summary
In todays increasingly complex and volatile business environment, organizations need to constantly adopt innovative technologies to compete. Big data refers to a collection of data that is beyond the ability of typical database software tools to collect, process, and deliver. When analyzed properly, big data can provide new business insights, open new markets, and create new competitive advantage in many industries. according to mgI1, for example, retailers can realize the potential of a sixty percent increase in operating margins by fully harnessing big data. In the healthcare industry, big data can reduce costs and enhance patient outcomes in diagnosis and treatment by driving efficiency, transparency, and quality. IdC expects the big data technology and services market to grow from $3.2 billion in 2010 to $16.9 billion in 2015, making it one of the fastest growing areas in the overall information and communication technology (ICt) market2. Big data has recently become an issue for organizations due to the dramatic increase in data creation and data gathering, driven by a number of technological innovations. the rise of mobile users has increased the aggregation of user statistics in the enterprise and, if properly synthesized and analyzed, these same statistics can provide highly relevant and competitive business intelligence. the increasing use of sensors for everything from traffic patterns, purchasing and buying behaviors, to real-time inventory management is another good example of the significant increase in large data sets. much of this data, gathered in real time, can provide unique and powerful intelligence, especially if it can be analyzed and acted upon quickly. Contrary to structured data that has historically been stored in data warehouses and analyzed with structured Query Language (sQL) analysis tools, big data requires a flat, horizontally scalable database, often with unique query tools that work in real time (as opposed to time delineated snapshots). It organizations must invest in new technologies and architectures in order to best leverage and gain advantage from the power of these new massive real-time data streams. In short, the big data phenomenon brings up a challenging question for CIOs and CtOs: What is the best big data infrastructure strategy? Fortunately, as big data pilots launch and business cases solidify, there are a number of changes occurring in network architectures that can enhance and help integrate big data processing and insights. Just as big data applications represent a new way of collecting, analyzing, and taking action on business data, the underlying network foundation of big data projects should be considered in a new light. Network architectures can either enhance or inhibit the ability to easily launch, grow, and integrate big data initiatives from pilot projects to largescale production. Consider apache hadoop, the de facto big data platform. to manage and process data in a server cluster, which is required to scale to thousands of servers, the performance and manageability of the network is critical. In fact, most data center infrastructures, especially the ones based on multitier networking, face operation and performance challenges in storing and analyzing big data in a midsize or large size hadoop cluster, which can interconnect thousands of servers. Organizations need a network solution that overcomes these issues and enables them to gain the considerable business intelligence benefits of big data analytics. Organizations can achieve simplified management, improved performance, and optimized data reliability by using Juniper Networks QFabric family of products. as a single logical switch, the QFabric system supports over 6,000 10gbe ports, and provides a consistent, extremely low latency of <5 microseconds even under load3. this paper provides an overview of big data use cases, discusses the networks role in hadoop clusters, and describes how QFabric technology supports a high-performance and scalable big data infrastructure.
For further information on big data use cases, see the mcKinsey global Institute article, Big data: the next frontier for innovation, competition, and productivity, at: www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation. For further information concerning IdC Worldwide Big data technology and services 2012-2015 Forecast, visit www.idc.com/getdoc.jsp?containerId=233485. For further information concerning the QFabric system, visit www.juniper.net/us/en/products-services/switching/qfx-series/qfabric-system/#literature.
2 3
Introduction
Big data refers to a volume of data that is beyond the ability of typical database software tools to collect, process, and deliver. When analyzed properly, big data can deliver new business insights, open new markets, and create competitive advantage. Compared to structured data that historically has been stored through data warehousing stores and analyzed with sQL analysis tools, big data has three major attributes: variety, volume, and velocity. First, big data extends beyond structured data and includes semi-structured or unstructured data of all varieties. this can include event data from active machine log files, text from social networking, data streams from financial data services, click streams from customers accessing web applications, activities data from machine-to-machine interchange, and even audio and video files. second, big data comes in large sized data sets because its predictability and analytics capabilities rely on sufficient data points, dealing with everything from traffic patterns, purchasing and buying behaviors, or real-time inventory management. Organizations are awash with data, easily amassing hundreds of terabytes and petabytes of information. third, organizations can maximize their datas business value by analyzing streaming. the rise of the security event manager (sem) industry is at the heart of gathering, analyzing, and proactively responding to event data from active machine log files in real time, providing unique and powerful business intelligence. driven by a combination of technology innovation, ubiquitous social networking, and pervasive mobile devices, the rise of big data has created an inflection point for organizations as they look for innovative ways to do business effectively and economically.
For further information on the article, how Big data Can mend Our Broken healthcare system, ewing marion Kauffman Foundation (april 2012), visit www.smartplanet.com/blog/business-brains/how-big-data-can-mend-our-broken-healthcare-system-study/23728.
Data Node Job Tracker Client Data Node Data Node Delivery Results
Hadoop Network Tra c Big Data Process Flow
Finance
Data Node
Video Web Log Data
Data Collection
For further information on yahoo! Launches Worlds Largest hadoop Production application, visit http://developer.yahoo.com/blogs/hadoop/posts/2008/02/yahooworlds-largest-production-hadoop/.
MULTITIER NETWORK
Core
Distribution 2 4
TOR 1 Client B1 Data Node Client Data Node Data Node Rack 1
TOR 2 Name Node Data Node Data Node Data Node Data Node Rack 2
TOR 3 Job Tracker Data Node Data Node Data Node Data Node Rack 3
TOR 8
TOR 9 Job Tracker 2 Data Node Data Node Data Node Data Node Rack 9
TOR 10 Client Data Node Data Node Data Node Data Node Rack 10
Name Node 2 Data Node Data Node B1 Data Node B1 Data Node Rack 8
Data Reliability
a hadoop cluster is built on a fleet of low-cost, high capacity disk drives with a network-based data replication system and a data fault tolerant system simply because todays raId technology does not meet required, cost-effective data storage scalability. For the setting of three copies of data replication, hdFs places one replica on one data Node in a rack, while placing another two copies on two different data Nodes in different racks, thereby preventing data loss in the event of rack failure (see B1, B1, and B1 in Figure 2). this suboptimal data placement policy concerns the significantly reduced bandwidth of inter-rack communications as compared with intra-rack server-to-server bandwidth. hdFs relies on the network to maintain data reliability in the event of failures, which can be either a disk drive, server, or network device failure, or a combination of these failures. When a disk drive fails, the data replication event takes hours to maintain data reliability. For example, it takes approximately 66 minutes to transfer 3 tB of data on a 1gbe network without considering network latency. the event takes 4 hours and 24 minutes to transfer data when a server that employs four, 3 tB disk drives fails. When a subset of data Nodes loses connectivity with the Name Node, hdFs may become unreliable because the network does not have sufficient bandwidth to re-replicate large amounts of data.
Performance
In a multitier network, the network performance such as latency and bandwidth depends on design considerations and devices. as a result, inter-rack server-to-server network performance varies widely from one network to another of similar size. although the latency of a typical 10gbe top-of-rack (tOr) switch is around 1 microsecond, the latency of intermediate switches at the distribution tier and core tier is significantly higher than that of a tOr switch. also, a multitier network does not provide efficient inter-rack server-to-server bandwidth for a hadoop cluster because of the compounded oversubscription introduced by switches at the distribution and core tiers. For example, when each data node is configured with a 20 gbps connection, the intra-rack server-to-server network bandwidth is 20 gbps. the average inter-rack server-to-server network bandwidth between rack 1 and rack 2 (Figure 2) is 8 gbps when the tOr switch operates at oversubscription of 2.5:1, which means total server communication bandwidth is 2.5 times the inter-rack network bandwidth. When considering the oversubscription of 4:1 on a distribution switch, the compounded oversubscription becomes 10:1, a common deployment scenario in most three-tier networks. as a result, the average inter-rack server-to-server bandwidth for replicating B1 and B1 can be as low as 2 gbps.
Network Operation
a multitier network significantly increases network management complexity. For a multitier network such as the one shown in Figure 2, the tOr switch interconnects servers within the rack. to meet the demand of an increasing number of servers, many intermediate switches are required to build a tree topology with hierarch switches. since each switch represents a management endpoint, network redesign is often required for performance assurance, high availability, and capability planning, as the size of the hadoop cluster grows. Network management also becomes increasingly complex because of large numbers of endpoints and the increased number of devices involved in server-to-server connectivity and performance troubleshooting. For example, the network provision or troubleshooting between B1 and B1 may involve up to five devices, as shown in Figure 2. In addition to storing and processing big data, the hadoop cluster needs to collect data. most unstructured data, such as event data from active machine log files, can be staged in file systems on an array of servers. however, structured data, such as purchasing transactions and real-time inventory tracking, commonly resides on a Fibre Channel (FC)based disk array. to rapidly collect the combination of structured and unstructured data in a real-time fashion, the client nodes (shown in Figure 2) need a converged network to support direct, optimal access to data through the storage area network (saN). however, a multitier network may not support such a rich set of storage protocols for rapid data collection. In an ideal network, the copies of data replication can be placed into different, unique racks, and the large-scale network can be simplified and managed as simply as a single physical switch.
Fabric
I/O Modules
QFabric Node
(I/O Modules)
CHASSIS SWITCH
DISTRIBUTED SWITCH
Figure 3: Comparison of a chassis switch and a QFabric System the QFabric system has the unique ability to support an entire data centerup to 6,144 10gbe portswith a single converged ethernet switch. as shown in Figure 3, similar to a standalone modular switch chassis that has three main function components (line cards, switch fabric, and routing engines), QFabric system is composed of three separate components in a distributed architecture.6 QFabric NodeLine card component of a QFabric system, which acts as the entry and exit into the fabric. Up to 128 QFabric Nodes can be interconnected in a single QFabric system. QFabric Interconnecthigh-speed transport device for interconnecting QFabric Nodes. a QFabric system supports up to 4 QFabric Interconnects. QFabric directordevice controller and services manager that delivers a common window for managing all components as a single device. the QFabric system is environmentally conscious, allowing enterprises to optimize every facet of the data center network while consuming less power, requiring less cooling, and producing a fraction of the carbon footprint for multitier data center networks. to achieve performance and economies of scale in hadoop, QFabric technology, with its simplified operation and consistent low-latency, is an ideal solution to build a big data network infrastructure that meets different organizations needs.
For further information, refer to the QFabric architectureImplementing a Flat data Center Network, at www.juniper.net/us/en/local/pdf/whitepapers/2000443-en.pdf.
Mid-size Hadoop
2.4 petabytes 2,400 cores
Equipment
servers8 Network devices 20 1 standalone QFX3500 200 Large QFabric system 10 QFabric Nodes 2 QFabric Interconnects 2 QFabric Directors 10 racks (average) 2,000 Large QFabric system 100 QFabric Nodes 4 QFabric Interconnects 2 QFabric Directors 100 racks and more
deployment footprint
Up to 1 rack
Network Summary
hops between servers average inter-rack server communication bandwidth Intra-rack server-to-server communication bandwidth maximum server-to-server latency 1 Na 20 gbps 1 microsecond9 1 8 gbps 20 gbps 5 microseconds 1 8 gbps 20 gbps 5 microseconds
7 8 9
1 petabyte = 1000 terabyte. each server is a 2 rU server with two six-core CPUs, 288 gB ram, four 3 tB hard drives, and a dual 10 gbps NIC. 1 microsecond = 1/1000 millisecond.
QFabric Interconnect 1 QFabric Node 1 Client B1 Data Node Client Data Node Data Node Rack 1 QFabric Node 2 Name Node Data Node Data Node Data Node Data Node Rack 2 QFabric Node 3 Job Tracker Data Node Data Node Data Node Data Node Rack 3 QFabric Node 8 Name Node 2 Data Node Data Node B1 Data Node Data Node Rack 8 QFabric Node 9 Job Tracker 2 Data Node Data Node B1 Data Node Data Node Rack 9 QFabric Node 10 Client Data Node Data Node Data Node Data Node Rack 10
10
Performance at Scale
Linear performance scalability of hadoop allows organizations to predict and plan their infrastructure: by doubling the size of a cluster, organizations can process twice the amount of data in a given time or reduce the execution time of a given amount of data by half. QFabric architecture supports such performance scalability with consistent, extremely low, any-to-any latency and efficient inter-rack bandwidth. By eliminating the intermediate switches, the QFabric system operates at oversubscription of 2.5:1, which means 400 gbps bandwidth within the rack and 160 gbps interrack bandwidth in a mid- and large size hadoop cluster that consists of 20 servers per rack with each server employing two 10 gbps NICs (shown in table 1). the intra-rack server communication bandwidth is 20 gbps, while the average inter-rack server-to-server communication bandwidth is 8 gbps.
Conclusion
hadoop can run on most data center networks. however, legacy network architectures are not designed to handle modern distributed application architectures, nor can they deliver the reliability and performance at scale demanded by big data. Just as big data applications represent a new way of collecting, analyzing, and taking action on business data, using the Juniper Networks QFabric system as the underlying network foundation of big data projects should be considered in a new light. Network architectures can either enhance or inhibit the ability to easily initiate, grow, and integrate big data initiatives from pilot to large-scale production. thus, organizations should further consider the following key questions: If a pilot is successful, how big will the cluster become? What is the easiest way to add compute capacity without adding complexity and cost to running a cluster at scale? Over the lifetime of the cluster, which hadoop or other applications will be running on the cluster? how do we extend data output or inputs to legacy or other applications? With the QFabric system, organizations can easily build small, mid-size, and large size hadoop clusters. Compared to the multitiered data center network approach, the QFabric system helps businesses develop a more simplified network operation, improve hadoop performance, and optimize hadoop data reliability, as shown in table 2.
Table 2: Hadoop Benefits Comparing Multitier Network Approach and QFabric System
Hadoop Features
Operation at scale reliability at scale Performance at scale
Multitiered Network
Complexity grows as the size of the cluster grows Current data placement policy concerns the limited inter-rack bandwidth suboptimal performance due to ad-hoc design
QFabric System
simplified Optimized Optimized
With the evolving trend in big data combined with continued growth in data creation, big data analytics demand an elastic data center infrastructure to effectively collect big data, process big data, and deliver actionable information in real time. Juniper Networks QFabric system is a data center solution that offers a high-performance, scalable, big data infrastructure with simplified management. With QFabric technology, CIOs and CtOs no longer need to worry about disruptive transformations in their big data initiatives.
11
Corporate and Sales Headquarters Juniper Networks, Inc. 1194 North mathilda avenue sunnyvale, Ca 94089 Usa Phone: 888.JUNIPer (888.586.4737) or 408.745.2000 Fax: 408.745.2100 www.juniper.net
APAC Headquarters Juniper Networks (hong Kong) 26/F, Cityplaza One 1111 Kings road taikoo shing, hong Kong Phone: 852.2332.3636 Fax: 852.2574.7803
EMEA Headquarters Juniper Networks Ireland airside Business Park swords, County dublin, Ireland Phone: 35.31.8903.600 emea sales: 00800.4586.4737 Fax: 35.31.8903.601
to purchase Juniper Networks solutions, please contact your Juniper Networks representative at 1-866-298-6428 or authorized reseller.
Copyright 2012 Juniper Networks, Inc. all rights reserved. Juniper Networks, the Juniper Networks logo, Junos, Netscreen, and screenOs are registered trademarks of Juniper Networks, Inc. in the United states and other countries. all other trademarks, service marks, registered marks, or registered service marks are the property of their respective owners. Juniper Networks assumes no responsibility for any inaccuracies in this document. Juniper Networks reserves the right to change, modify, transfer, or otherwise revise this publication without notice.
2000483-001-eN
June 2012
12