Big Data Overview

Welcome to Big Data Overview.
Click the Notes tab to view text that corresponds to the audio recording. Click the Supporting
Materials tab to download a PDF version of this eLearning.
Copyright 1996, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014 EMC Corporation. All Rights
Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subjectto change without
notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED AS IS. EMC CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY
KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.
EMC2, EMC, Data Domain, RSA, EMC Centera, EMC ControlCenter, EMC LifeLine, EMC OnCourse, EMC Proven, EMC Snap, EMC SourceOne,
EMC Storage Administrator, Acartus, Access Logix, AdvantEdge, AlphaStor, ApplicationXtender, ArchiveXtender, Atmos, Authentica,
Authentic Problems, Automated Resource Manager, AutoStart, AutoSwap, AVALONidm, Avamar, Captiva, Catalog Solution, C-Clip, Celerra,
Celerra Replicator, Centera, CenterStage, CentraStar, ClaimPack, ClaimsEditor, CLARiiON, ClientPak, Codebook Correlation Technology,
Common Information Model, Configuration Intelligence, Configuresoft, Connectrix, CopyCross, CopyPoint, Dantz, DatabaseXtender, Direct
Matrix Architecture, DiskXtender, DiskXtender 2000, Document Sciences, Documentum, elnput, E-Lab, EmailXaminer, EmailXtender,
Enginuity, eRoom, Event Explorer, FarPoint, FirstPass, FLARE, FormWare, Geosynchrony, Global File Virtualization, Graphic Visualization,
Greenplum, HighRoad, HomeBase, InfoMover, Infoscape, Infra, InputAccel, InputAccel Express, Invista, Ionix, ISIS, Max Retriever, MediaStor,
MirrorView, Navisphere, NetWorker, nLayers, OnAlert, OpenScale, PixTools, Powerlink, PowerPath, PowerSnap, QuickScan, Rainfinity,
RepliCare, RepliStor, ResourcePak, Retrospect, RSA, the RSA logo, SafeLine, SAN Advisor, SAN Copy, SAN Manager, Smarts, SnapImage,
SnapSure, SnapView, SRDF, StorageScope, SupportMate, SymmAPI, SymmEnabler, Symmetrix, Symmetrix DMX, Symmetrix VMAX,
TimeFinder, UltraFlex, UltraPoint, UltraScale, Unisphere, VMAX, Vblock, Viewlets, Virtual Matrix, Virtual Matrix Architecture, Virtual
Provisioning, VisualSAN, VisualSRM, Voyence, VPLEX, VSAM-Assist, WebXtender, xPression, xPresso, YottaYotta, the EMC logo, and where
information lives, are registered trademarks or trademarks of EMC Corporation in the United States and other countries.
All other trademarks used herein are the property of their respective owners.
Copyright 2014 EMC Corporation. All rights reserved. Published in the USA.
Revision Date: March 2014

Revision Number: MR-1WN-BDOVIEW
Copyright 2014 EMC Corporation. All Rights Reserved.
Big Data Overview
This course provides an overview of Big Data and Big Data analytics. It starts with a definition
of big data and describes the unique circumstances for analyzing that data. It defines the role
of the data scientist and describes the tools that a data scientist might use to process that
data, and concludes with a summary of the products that EMC provides to support the
customers Big Data analytics needs.
Big Data Overview
This slide lists the lessons covered in this module. Please take a moment to review them.
Big Data Overview
This lesson includes a definition of Big Data and a description of Big Data characteristics and
considerations. Additionally, you will learn how unstructured data is fueling Big Data analytics,
and how this has influenced the analyst perspective on data repositories.
Big Data Overview
Think about what Big Data is for a moment. Is there a size threshold over which data becomes
Big Data? Is it the number of rows or records? Is it the number of columns or variables? How
much does the complexity of its structure influence the designation as Big Data? How new are
the analytical techniques?
Big Data Overview
There are multiple characteristics of big data, but four stand out as defining characteristics.
You can think of these characteristics as 3V+C.
The huge volume of datafor instance, tools that can manage billions of rows and millions
of columns,
The complexity of data types and structures, with an increasing variety of unstructured
data. Eighty to ninety percent of the data in existence is unstructuredpart of the Digital
Shadow or Data Exhaust, and
The speed or velocity of new data creation.
Additionally, the data, due to its size or level of structure, cannot be efficiently analyzed using
only traditional databases or methods.
There are many examples of emerging big data opportunities and solutions. They include such
things as Netflix suggesting your next movie rental, the dynamic monitoring of embedded
sensors in bridges to detect real-time stresses and longer-term erosion, and retailers analyzing
digital video streams to optimize product and display layouts and promotional spaces on a
store-by-store basis.
These kinds of big data problems require new tools and technologies to store, manage, and
realize the business benefit. Big data necessitates new architectures that are supported by
new tools, processes, and procedures that enable organizations to create, manipulate, and
manage these very large data sets and the storage environments that house them.
Big Data Overview
Big data can come in multiple formseverything from highly structured financial data, to text
files, to multi-media files and genetic mappings. The high volume of the data is a consistent
characteristic of big data. As we will see in the next slide, most of the big data is unstructured
or semi-structured in nature, which requires different techniques and tools to process and
analyze.
Big data is a relative term. For some organizations, terabytes of data may be unmanageable,
other organizations may find that petabytes of data are overwhelming. If you cant process
your data with your existing capabilities, then you have a Big Data problem.
Big Data Overview
The most prominent feature of big data is its structure. Eighty to ninety percent of future data
growth will come from non-structured data types.
The four different, separate types of data can be mixed together at times. For instance, you
may have a classic relational database management system, RDBMS, storing call logs for a
software support call center. In this case, you may have typical structured data. Additionally,
you will likely have unstructured or semi-structured data, such as free form call log
information, taken from an email ticket or an actual phone call description of a technical
problem and the solution.
Another possibility would be voice logs or audio transcripts of the actual call that might be
associated with the structured data. Until recently, most analysts would be able to analyze the
most common and highly structured data in this call log history RDBMS, since the mining of
the textual information is very labor intensive and could not be easily automated.
Big Data Overview
People tend to both love and hate spreadsheets. With their introduction, business users were
able to create simple logic on data structured in rows and columns and create their own
analyses to business problems. Spreadsheets are easy to share and end users have control
over the logic involved. However, proliferation of spreadsheets (data islands) caused
organizations to struggle with many versions of the truth, meaning it was impossible to
determine if you have the right version of a spreadsheet, with the most current data and logic
in it.
As data needs grew, companies such began offering more scalable data warehousing
solutions. These technologies enabled companies to manage the data centrally, including the
benefits of security, failover, and a single repository where users could rely on getting an
official source of data for financial reporting or other mission-critical tasks.
Another implication of this phase is that Enterprise Data Warehouse (EDW) rules restrict
analysts from building data sets, which can cause shadow systems to emerge within
organizations containing critical data for constructing analytic data sets, managed locally by
power users.
Analytic sandboxes enable high performance computing using in-database processing. This
approach creates relationships to multiple data sources within an organization and saves the
analyst time of creating these data feeds on an individual basis and faster turnaround time for
developing and executing new analytic models.
Big Data Overview
This lesson covers business drivers for analytics, current analytical architecture, business
intelligence versus data science, and drivers of Big Data and new Big Data ecosystems.
Big Data Overview
10
Today, organizations contend with several business problems where they have an opportunity
to leverage advanced analytics to create competitive advantage. Rather than doing standard
reporting on these areas, organizations can apply advanced analytical techniques to optimize
processes and derive more value from these typical tasks.
The first three examples listed in this slide are not new problems companies have been
trying to reduce customer churn, increase sales, and cross -sell customers for many years.
Whats new is the opportunity to fuse advanced analytical techniques with big data to
produce more impactful analyses for these old problems.
Big Data Overview
11
Business Intelligence (BI) focuses on using a consistent set of metrics to measure past
performance and inform business planning. This includes creating Key Performance Indicators
(KPIs) that reflect the most essential metrics to measure your business.
Predictive Analytics and Data Mining (data science) refer to a combination of analytical and
machine learning techniques used for drawing inferences and insight from data. These
methods include approaches such as regression analysis, Association Rules also known as
Market Basket Analysis, optimization techniques, and simulationsfor example, Monte Carlo
simulation to model scenario outcomes. These are more robust techniques for answering
higher order questions and deriving greater value for an organization.
Big Data Overview
12
This graphic portrays a typical data warehouse and some of the challenges that it presents. For
source data to be loaded into the Enterprise Data Warehouse (EDW), data must be well
understood, structured, and normalized with the appropriate data type definitions.
As a result of this level of control on the EDW, shadow systems emerge in the form of
departmental warehouses and local data marts that business users create to accommodate
their need for flexible analysis. These local data marts do not have the same constraints for
security and structure as the EDW does, and allow users across the enterprise to do some
level of analysis.
Once in the data warehouse, data is fed to enterprise applications for business intelligence
and reporting purposes. These are high priority operational processes getting critical data
feeds from the EDW.
At the end of this work flow, analysts get data provisioned for their downstream analytics.
Since users cannot run custom or intensive analytics on production databases, analysts create
data extracts from the EDW to analyze.
Lastly, because data slowly accumulates in the EDW due to the rigorous validation and data
structuring process, data is slow to move into the EDW and the schema is slow to change.
EDWs generally limit the ability of analysts to iterate on the data in a separate environment
from the production environment where they can conduct in-depth analytics, or perform
analysis on unstructured data.
Big Data Overview
13
Todays typical data architectures were designed for storing mission-critical data, supporting
enterprise applications, and enabling enterprise level reporting. These functions are still
critical for organizations, although these architectures inhibit data exploration and more
sophisticated analysis.
Big Data Overview
14
Key/Value databases are simply those datastores where data is stored and retrieved by a
unique keyusually treated as an uninterpreted string of bytes. Associated with each key is a
value; again, the values contained in this datastore may be in any format, and may differ for
each <key/value> pair.
NoSQL is a label that has been applied to databases that do not necessarily use SQL or
Structured Query Language, to access the data. It can also mean not only SQL, which implies
that other forms of queries can be used against the data.
Everyone and everything is leaving a digital footprint. This graphic provides a perspective on
sources of big data generated by new applications and the scale and growth rate of the data.
These applications provide opportunities for new analytics and driving value for organizations.
The data come from multiple sources. The Big Data trend is generating an enormous amount
of information that requires advanced analytics and new market players to take advantage of
that information.
Big Data Overview
15
Organizations and data collectors are realizing the data they can collect from you is really very
valuable, and a new economy is rising around the collection of data because data is streaming
off all of these computing and network devices.
Consider this example. Item 1 portrays Data Devices and the Sensor Net that is collecting
data from multiple locations and is continuously generating new data about this data. For
each gigabyte of new data you create, a petabyte of data is created about that data.
Data Collectors (the blue circles) include entities that are collecting data from the device and
users. This can range from your cable TV provider tracking the shows you watch to retail stores
tracking the path you take pushing a shopping cart in their store with a radio-frequency
identification (RFID) chip so they can gauge which products get the most foot traffic.
Data Aggregators (the dark grey circles) make sense of the data collected from the various
entities from the Sensor Net or the Internet of Things. These organizations compile data
from the devices and usage patterns and choose to transform and package these data as
products to sell to list brokers, who may want marketing lists of people who may be good
targets for specific ads.
At the outer edges of this web of the ecosystem are Data Users and Buyers. These groups
directly benefit from the information collected and aggregated by others in the data value
chain.
Big Data Overview
16
Big data projects carry with them several considerations that you need to keep in mind to
ensure this approach fits with what you are trying to achieve. The analytic techniques being
used in this context need to be iterative and flexible (analysis flexibility), due to the high
volume of data and its complexity. These conditions give rise to complex analytical projects
(such as predicting customer churn rates) that can be performed with some latency (consider
the speed of decision making needed), or by operationalizing these analytical techniques
using a combination of advanced analytical methods, big data and machine learning
algorithms to provide real time (requires high throughput) or near real time analysis, such as
recommendation engines that look at your recent web history and purchasing behavior.
Additionally, to be successful you will need a different approach to the data architecture than
seen in todays typical EDWs. Analysts need to partner with IT and DBAs to get the data they
need within an analytic sandbox, which contains raw data, aggregated data, and data with
multiple kinds of structure.
Big Data Overview
17
Big Data analytics have been used across multiple industries, including health carereducing
the cost of care, public servicespreventing pandemics, life sciencesgenomic mapping, IT
infrastructureunstructured data analysis, and online servicessocial media for
professionals.
Lets consider two of these in more detail.
In the case of health care, Dr Jeffrey Brenner at Rutgers University generated his own crime
maps from medical billing records from three hospitals. He was motivated to do so after
observing poor police response and problems with medical care associated with the shooting
of a Rutgers student. By utilizing data collection and visualization, Dr. Brenner determined that
the city hospitals and ERs were providing expensive but low quality care.
The story of LinkedIn reflects an opportunity to create social media space for professionals.
LinkedIn now collects and analyzes data from over 100 million users; the site adds one million
users per week. Via this analysis, LinkedIn is able to offer valuable services, such as LinkedIn
Skills, InMaps, job recommendations and recruiting.
Since LinkedIns founder is convinced that data analytics is the wave of the future, LinkedIn has
gone so far as to establish a diverse group of data scientists.
Big Data Overview
18
The loan process has been honed to a science over the past several decades. Unfortunately
todays realities require that lenders take more care to make better decisions with fewer
resources than theyve had in the past.
The typical loan process uses a set of data providing a base for pre-approval and underwriting
approval, including income data, employment history, credit history, and appraisal data.
This model works but its not perfect. Using Big Data, we can dramatically impact not only the
quality of the loan underwriting process, but we can streamline the process to yield results in
less time. A secondary benefit is that the approval process can be streamlined from an
average of 3 to 4 weeks to 2 to 3 weeksa savings of over 30%. In some situations, this
benefit can cut the time almost in half, if more data sources are available for analysis.
Loan approvals are just the tip of the iceberg; far more transformative changes can be enabled
with Big Data in the financial services arena. A change in life event could trigger an alert to
contact the customer. Travel insurance could be offered if credit card transaction data or a
Tweet indicates an upcoming international vacation. The possibilities are endless when you
use insight to drive action with Big Data.
Big Data Overview
19
This lesson covers the key roles of the new data ecosystem and provides a profile of a data
scientist.
Big Data Overview
20
The new data ecosystem, driven by the arrival of big data, will require three types of roles to
provide services.
The ecosystem requires deep analytical talent. The data scientist should be technically
knowledgeable, with strong analytical skills, and possess a combination of skills to handle raw
data, unstructured data, and complex analytical techniques at massive scales. Existing
professions include data scientists, statisticians, economists, and mathematicians.
The new ecosystem also requires data knowledgeable professionals such as financial analysts,
market research analysts, life scientists, operations managers, business and functional
managers.
Finally, the ecosystem will need technology and data enablers, such as computer
programmers, database administrators, and computer system analysts.
Big Data Overview
21
Typical analytical projects need data scientists, data engineers, data analysts, BI analysts, and a
line of business (LOB) user. Consider that the data scientist role now combines several of the
skill sets that were separate roles in the past. Rather than having separate people for
consultative aspects of the discovery phase of a project, a different person to deal with the
end user in a line of business, another person with technical and quantitative expertise, the
data scientist combines each of these aspects to provide continuity throughout the analytical
process.
Big Data Overview
22
What are the competency and behavioral characteristics of a data scientist? First are
quantitative skills, such as mathematics or statistics. Second is technical aptitude, such as
software engineering, machine learning, and programming skills. Third, the data scientist must
be skeptical. It is important that data scientists examine their work critically rather than in a
one-sided way. Fourth, the data scientist is curious and creative. Data scientists must be
passionate about data and finding creative ways to solve problems and portray information.
Lastly, the data scientist must be communicative and collaborative. Data scientists must be
able to articulate the business value in a clear way and work collaboratively with project
sponsors and key stakeholders.
Big Data Overview
23
This lesson covers tools for the data scientist: the SMAQ Stack, Hadoop and HDFS,
MapReduce, and the Hadoop Ecosystem.
Big Data Overview
24
In a blog post at OReilly Radar, Ed Dumbill outlined the notion of a processing stack for big
data, consisting of Storage, Map/Reduce technologies, and Query technologiesthe SMAQ
Stack. As expected, storage is a foundational aspect of this endeavor, and is characterized by
distribution and by unstructured content.
At the intermediate layer, MapReduce technologies enable the distribution of computation
across many servers (send the computation to the data) as well as supporting a batchoriented processing model of data retrieval and computation as opposed to the record-set
orientation of most SQL-based databases. Finally, at the top of the stack are the Query
functions.
Big Data Overview
25
In late 2011, Information Week magazines lead article was Hadoopla: Why Hadoop is the
toast of the Big Data era. Hadoop has certainly captured the imagination of the industry.
However, Hadoop can mean different things to different people. For some it represents a
parallel programming paradigm, and massive unstructured data storage using commodity
hardware (that doesnt mean cheap).
For others, Hadoop refers to the Hadoop Distributed File System, HDFS, that implements the
unstructured data storage, or the set of Java classes by which a Java programmer can access
HDFS data or write Java code that provides the map and reduce functions.
Big Data Overview
26
Lets look a little deeper at the Hadoop Distributed File System (HDFS.) The NameNode and
the DataNode are part of the HDFS implementation. Apache Hadoop has one NameNode and
multiple DataNodes. The NameNode service in Hadoop acts as a regulator/resolver between a
client and the various DataNode servers. The NameNode manages that name space by
determining which DataNode contains the data requested by the client and redirecting the
client to that particular DataNode.
DataNodes in HDFS are (oddly enough) where the data is actually stored. The data itself is
replicated across racks. This means that a failure in one rack will not halt data access at the
expense of a possibly slower response. Since HDFS isnt suitable for near real-time access, this
is acceptable in the majority of cases.
Big Data Overview
27
The MapReduce function within Hadoop depends on two different nodes: the JobTracker and
the TaskTracker. The JobTracker node exists for each MapReduce implementation. JobTracker
nodes are responsible for distributing the Mapper and Reducer functions to available
TaskTrackers and monitoring the results, while TaskTracker nodes actually run the jobs and
communicate results back to the JobTracker. The communication between nodes is often
through files and directories in HDFS so internode or network communication is minimized.
In this example, (Step 1), we have a very large data set containing log files, sensor data, et
cetera. HDFS stores replicas of that datarepresented here by the blue, yellow, and beige
icons, across DataNodes.
In Step 2, the client defines and executes a map job and a reduce job on a particular data set,
and sends them both to the JobTracker, where in Step 3, the jobs are in turn distributed to the
TaskTrackers nodes. The TaskTracker runs the mapper, and the mapper produces output that
is stored in the HDFS file system. Lastly, in Step 4, the reduce job runs across the mapped data
in order to produce the result.
Big Data Overview
28
The idea of MapReduce isnt new. What is new in Googles MapReduce is the parallel
processing of data as well as computation. Here, a set of worker tasks each work on a subset
of data where each subset is usually physically and logically distinct one from the other. In the
world of databases, this separation is often called sharding. Sharding is a technique whereby
data is partitioned such that I/O operations can proceed without concern for other users.
MapReduce also borrows some elements of functional programming. All data elements in
Map/Reduce are immutable. At base, Map/Reduce programs transform lists of input data
elements into lists of output data elements, and do so twice: once for the Map phase and
once for the Reduce phase.
Big Data Overview
29
The traditional example of a MapReduce problem is that of counting words in a very, very
large body of data: millions of documents over hundreds of machines. Lets assume that we
wish to know how many times the word beach appears in this body of data.
Mapper functions do one thing and only one thing: process the data provided and count each
time the word beach appears. The Reducer functions collect all the input from the Mappers
and aggregate the results to provide a single answer: the word beach appears five times.
Obviously, this is a toy example. Consider, however, that this same technique can be used to
create a list of words across multiple documents, or create a wordlist of one document that
can be compared with that of yet another.
Big Data Overview
30
Lets go back and revisit the Map/Reduce data flow in more detail.
Map/Reduce data flow is conceptually simple. Input data is provided to a Map/Split function
that assigns a number of worker tasks to process the input data in parallel. The data must be
suitably distributed such that all the data is covered but only one worker task works on a
particular set of data. The mapped data is now presented to the Reduce (Summarize) function
sorted in ascending order.
There well may be multiple entries with the same key; in this case, the Reduce will aggregate
the values in some way and output the final value for the associated key.
Big Data Overview
31
The Hadoop system has been used as a base for multiple projects. Among them are the Pig
and Hive query languages and the HBase database system. In all instances, code can be
written in Java to implement the Map/Reduce functions.
Pig is a dataflow language that breaks processing into separate steps. These intermediate
steps usually translate into Map/Reduce tasks that actually extract the data. People with
knowledge of scripting languages like Python, Ruby, or Perl will find this transition to be less
painful. User-defined processing can be implemented in UDFs in Java.
Hive is an SQL-like language that allows the user to access data stored in HDFS as if it were
stored in tables. People with prior knowledge of SQL and RDBMs will find this approach to be
more suitable.
HBase is a datastore that mimics the functionality provided by Googles BigTable. The original
description of BigTable says it best, BigTable is a distributed storage system for managing
structured data that is designed to scale to a very large size: petabytes of data across
thousands of commodity servers.
Big Data Overview
32
This lesson covers concepts in practice, EMC and Big Data: Isilon Scale Out NAS, Atmos,
Greenplum DB, and MAD Analytics.
Big Data Overview
33
This is what EMC is doing with the Isilon family. The EMC Isilon S Series, the next generation of
the Isilon platform, was introduced in March 2011 to demonstrate record scalability and
remarkable simplicity simultaneously. Isilon S200 has linear throughput scalability with a two
times increase in the megabytes per second to over 80 gigabytes per second. Similarly, theres
a dramatic increase in transaction scalabilityagain, a linear increase in the IOPS of the
platform to over 1.2M SPEC FS-like capabilities.
Big Data Overview
34
The EMC Isilon NL Series, compared to competitors, has record file system capacity15
petabytes.
For fun, we thought we would draw the competitors to scale. They just seem to disappear!
Big Data Overview
35
The EMC Atmos platform provides an alternate look at managing Big Data in the Cloud.
Big Data Overview
36
Consider the case of storage islands. In these implementations, storage was running on
disparate systems, requiring manual administration, supporting a single tenant on many
systems, and requiring IT provisioned storage.
Atmos replaces this approach with one that provides a single storage pool: a single system
that spans location with automated policies, self-service access, and an architecture that
supports many tenants across a single system.
Big Data Overview
37
Atmos excels at efficiently storing and managing distributed Big Data. Atmos installations
scale-out seamlessly, and automate data placement, protection, and services. Atmos provides
easy access across networks and platforms, as well as metering and self-service across
tenants.
Big Data Overview
38
Atmos can be deployed in two ways. For those who have existing VMware-certified storage
environments, Atmos Virtual Edition software-only deployment allows you to transform
existing IT investments to the cloud. Or you can choose an integrated Atmos
software/hardware solution. This option offers all the power of Atmos software on purposebuilt, low cost, high-density software.
Big Data Overview
39
Traditional analytics focused on structured data only.

For structured data, EMC offers the Greenplum database which is Big Data analytics.
For semi-structured data, EMC provides Greenplum Hadoop.
Big Data Overview
40
Consider the EMC Greenplum Distributed Computing Appliance, or DCA. The DCA is a
massively parallel architecture with five times the data loading capabilities of the nearest
competitor. This platform has remarkably stunning scalability.
Big Data Overview
41
The EMC Greenplum database is the industrys most scalable analytic database. It features
shared nothing, in stark contrast to Oracle and DB2. Operations are extremely simpleyou
just load data, and Greenplums automated parallelization and tuning provide the restno
partitioning is required. If you need to scale, simply add nodes you get storage, performance,
and load bandwidth entirely in software. Greenplum DCA fully leverages the industry standard
x86 platform.
Big Data Overview
42
The combination of the Greenplum database, GPDB, and Hadoop provides the best of both
worlds. Both GPDB and Hadoop move analytics to the data a key requirement for Big Data
analytics. GPDB with Hadoop deliver a powerful solution for the analytics of structured, semistructured, and unstructured data.
A customer can perform complex, high-speed, interactive analytics using GPDB, as well as
streaming the data directly from Hadoop into GPDB to incorporate unstructured or semistructured data within GPDB in the analysis shown here.
Hadoop also can be used to transform unstructured and semi-structured data into a
structured format that can then be fed into GPDB for high speed, interactive querying. And
while were at it, Greenplum Chorus can support analytic productivity and tool integration that
encourages collaboration.
Big Data Overview
43
Consider this brief overview of the MAD approach to analytics. This is one way organizations
are responding to the need of analysts and data scientists to have more control over data, and
establishing an analytic sandbox in which to perform more sophisticated types of analysis in a
flexible manner.
Key areas of the MAD approach are Magnetic, Agile, and Deep. Magnetic, in the sense that
the traditional EDW approaches repel" new data sources, discouraging incorporation until
the sources are carefully cleansed and integrated. A data warehouse can keep pace today only
by being magneticattracting all the data sources that crop up within an organization
regardless of data quality niceties.
Agile, in the sense that data warehousing orthodoxy is based on long range, careful design
and planning. A modern warehouse must instead allow analysts to easily ingest, digest,
produce, and adapt data at a rapid pace. This requires a database whose physical and logical
contents can be in continuous rapid evolution.
Deep, in that modern data analyses involve increasingly sophisticated statistical methods that
go well beyond the rollups and drilldowns of traditional BI. Moreover, analysts often need to
see both the forest and the trees in running these algorithms. The modern data warehouse
should serve both as a deep data repository and as a sophisticated algorithmic runtime
engine.
Big Data Overview
44
Consider these four elements of the EMC Big Data Stack. We see against each one of them
that these are the key attributes to building a next generation Big Data stack, a stack
dramatically different than todays data management solutions and data warehouses. A Big
Data stack must be able to operate a multi-petabyte scale, to handle structured and
unstructured data, to operate in real time, and be increasingly collaborative across the
enterprise.
Big Data Overview
45
Would you like to know more about Big Data and Data Science? Consider signing up for the
Data Science and Big Data Analytics course.
Big Data Overview
46
Please take a moment to complete these check your knowledge questions.
Big Data Overview
47
This course covered the definitions of big data and big data analytics, as well as describing the
role of the data scientist and the tools that she uses to derive business value from big data.
EMC supports the Big Data stack with several products at multiple layers, such as Atmos and
Isilon at the storage level, Greenplum DB with in-database analytics at the analysis level, and
Documentum xCP and Greenplum Chorus at the collaboration level.
This concludes the training. Proceed to the course assessment on the next slide.
Big Data Overview
48
Big Data Overview
49

Big Data Overview - SRG

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Overview - SRG

Uploaded by

Copyright:

Available Formats

Welcome to Big Data Overview.

Revision Date: March 2014

Copyright 2014 EMC Corporation. All Rights Reserved.

Copyright 2014 EMC Corporation. All Rights Reserved.