You are on page 1of 39

Information Management Skills Development

Big Data
Module ID

10500

Length

1 hour

For questions about this presentation contact askdata@ca.ibm.com

12/13/16
2015 IBM Corporation

Information Management Skills Development

Disclaimer
Copyright IBM Corporation 2015. All rights reserved.
THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES
ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE
INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED AS IS WITHOUT WARRANTY OF
ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBMS CURRENT
PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM
SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE
RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS
PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR
REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND
CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR
SOFTWARE.
IBM, the IBM logo, ibm.com are trademarks or registered trademarks of International Business Machines
Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked
on their first occurrence in this information with a trademark symbol ( or ), these symbols indicate U.S.
registered or common law trademarks owned by IBM at the time this information was published. Such trademarks
may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available
on the Web at Copyright and trademark information at www.ibm.com/legal/copytrade.shtml
Other company, product, or service names may be trademarks or service marks of others.

2015 IBM Corporation

Information Management Skills Development

Module Information
After completing this module, you should be able to:
Understand what big data is
Understand some use cases
Understand whats included in IBMs Big Data platform
Understand the Open Data Platform initiative
Understand IBMs Open Platform with Apache Hadoop
Understand IBM BigInsights v4 components and packaging structure

2015 IBM Corporation

Information Management Skills Development

What is BIG DATA?


All kinds of data
Large volumes
Valuable insight, but difficult to extract
May be extremely time sensitive
Big Data is a Hot Topic Because Technology Makes it Possible to Analyze ALL
Available Data

BIG DATA

Big Data technologies describe a new generation of technologies and architectures, designed
to economically extract value from very large volumes of a wide variety of data, by enabling
high velocity capture, discovery and/or analysis.
Source: Matt Eastwood, IDC
4

2015 IBM Corporation

Information Management Skills Development

Information is at the Center


of a New Wave of Opportunity
2.5 million items
per minute

5 TB per flight

300,000 tweets
per minute

> 1 PB per day


gas turbines

200 million emails


per minute
220,000 photos
per minute

Velocity
Variety
Volume

2020
40 zettabytes

80%

Of worlds data
is unstructured

and Organizations Need


Deeper Insights

1 in 3

Business leaders frequently make


decisions based on information they
dont trust, or dont have

1 in 2

Business leaders say they dont


have access to the information they
need to do their jobs

83%

of CIOs cited Business


intelligence and analytics as
part of their visionary plans
to enhance competitiveness

60%

of CEOs need to do a better job


capturing and understanding
information rapidly in order to
make swift business decisions

2012
2.8 zettabytes

2015 IBM Corporation

Information Management Skills Development

Big Data means Big Opportunities


Extract insight from a high volume, variety and velocity of data in a
timely and cost-effective manner

Variety:

Manage and benefit from


diverse data types and data
structures

Velocity: Analyze streaming data and


large volumes of persistent
data
Volume: Scale from terabytes to
zettabytes
6

2015 IBM Corporation

Information Management Skills Development

What we hear from customers . . . .


Lots of potentially valuable data is dormant
or discarded due to size/performance
considerations
Large volume of unstructured or semistructured data is not worth integrating
fully (e.g. Tweets, logs, . . .)
Not clear what should be analyzed
(exploratory, iterative)
Information distributed across multiple
systems and/or Internet
Some information has a short useful
lifespan
Volumes can be extremely high

2015 IBM Corporation

Information Management Skills Development

Merging the Traditional and Big Data approaches


Traditional Approach
Structured & Repeatable Analysis

Business Users
Determine what
question to ask

IT
Delivers a platform to
enable creative
discovery

IT

Business

Structures the
data to answer
that question

Explores what
questions could be
asked

Monthly sales reports


Profitability analysis
Customer surveys

Big Data Approach


Iterative & Exploratory Analysis

Brand sentiment
Product strategy
Maximum asset utilization

2015 IBM Corporation

Information Management Skills Development

The 5 Key Big Data Use Cases

Big Data Exploration


Find, visualize,
understand all big data to
improve decision making

Enhanced 360 Degree View


of the Customer
Extend existing customer views
by incorporating additional
internal and external data sources

Operations Analysis
Analyze a variety of machine
data for improved business results

Security/Intelligence
Extension
Lower risk, detect fraud and
monitor cyber security in realtime

Data Warehouse Augmentation


Integrate big data and data warehouse capabilities
to increase operational efficiency

Also Spatial Analytics, statistics, Text analytics, Machine Learning, Audio/Video/Image


Analysis and more.
9

2015 IBM Corporation

Information Management Skills Development

Big Data scenarios span many industries


Multi-channel customer
sentiment and experience
analysis
Detect life-threatening
conditions at hospitals in
time to intervene
Predict weather patterns to plan
optimal wind turbine usage, and
optimize capital expenditure on
asset placement
Make risk decisions based on
real-time transactional data

Identify criminals and threats


from disparate video, audio,
and data feeds
10

2015 IBM Corporation

Information Management Skills Development

Constant Contact Transforming


Email Marketing Campaign
Effectiveness with IBM Big Data
Capabilities
IBM BigInsights, IBM PureData
for Analytics powered by
Netezza technology, Cognos
BI
Need
Analyze 35 billion annual emails to guide
customers on best dates & times to send emails
for maximum response

Benefits
40 times improvement in analysis performance
15-25% performance increase in customer email
campaigns
Analysis time reduced from hours to seconds

11

2015 IBM Corporation

Information Management Skills Development

Vestas optimizes capital


investments based on 2.5
Petabytes of information
Need
Model the weather to optimize placement of
turbines, maximizing power generation and
longevity

Benefits
Reduce time required to identify placement
of turbine from weeks to hours
Reduces IT footprint and costs, and
decreases energy consumption by 40 % -while increasing computational power
Incorporate 2.5 PB of structured and semistructured information flows. Data volume
expected to grow to 6 PB

12

2015 IBM Corporation

Information Management Skills Development

Teikoku Databank Cuts time needed to process billions of textual


data items from several days to 30 minutes
Accelerates processing

textual data, speeding delivery


of information to clients

Analyzes 4.75 times more

data, enhancing corporate credit


offering to customers

Differentiates

the company, providing a


significant competitive advantage

The transformation: By effectively combining proprietary data


with big data from the Internet, the company maximizes utilization
of the available data sets to deliver more detailed information to
customers.

Solution components

With IBM BigInsights, it has become possible to


process billions of items of textual data in 30
minutes.

Software
IBM BigInsights

13

Mr. Satoshi Kitajima, an MBA Statistician in the SPECIA Team


of the Business Analytics Division of the Market and Business
Intelligence Department, Teikoku Databank

2015 IBM Corporation

Information Management Skills Development

Brocade Delivers end-to-end big data solutions to accelerate


analysis and improve customer satisfaction
Unparalleled insight
to help identify and address
customer needs in near real time

Increases satisfaction

and lowers customer churn for


organizations across industries

Improves service levels


across the network, and enables
optimization of IT resources
Solution components
Software
IBM InfoSphere Streams
IBM BigInsights for Apache Hadoop

The transformation: Continuous analysis of huge volumes


of in-motion data enables business insights at unprecedented
speeds.

We can deliver a cost-effective, scalable and


high-performance big data platform that helps
organizations uncover new business opportunities
Mike Harrison, Vice President, Brocade

Hardware
IBM System x

14

2015 IBM Corporation

Information Management Skills Development

YaData Solutions Unearths insights about big data


93% reduction
in customized marketing cycle time,
from two weeks to 24 hours for an
entertainment company

Provides Insights
to customers across all industries,
helping companies make faster
and better decisions

Reduce customer churn


by identifies insights that can
potentially save customers millions
of dollars
Solution components
Software
IBM BigInsights
IBM Cloud Computing
IBM Mobile Enterprise
IBM Expert Integrated Systems
IBM Social Business
15

The transformation: Builds a Semantic Pattern Accelerator


platform for big data potentially saving customers millions of
dollars

We take the best of everything IBM offers, place it in


a big data platform, integrate tools using a semantic
model and use parametric relationships that leverage
IBM analytic engines in order to create discovery
Sean OBrien, Principal, YaData Solutions Inc.

2015 IBM Corporation

IBMs approach

12/13/16
2015 IBM Corporation

Information Management Skills Development

IBMs Big Data strategy

17

Integrate and Manage the full variety, velocity and volume of Big Data

Apply advanced analytics to information in its native form

Visualize all available data for ad-hoc analysis

Development environment for building new analytic applications

Support workload optimization and scheduling

Provide for security and governance

Integrate with enterprise software

2015 IBM Corporation

Information Management Skills Development

IBM Big Data Platform and Application Framework


Gather, Extract
and Explore
data using best
of breed
visualization
Cost-effectively
Analyze
Petabytes of
structured and
unstructured
information
Govern data
quality and
Manage
information
lifecycle

Analytic Applications
BI /
Reporting

Exploration /
Visualization

Industry Predictive
App
Analytics

Content BI. /. . .
AnalyticsReporting

IBM Big Data Platform


Visualization
& Discovery

Application
Development

Systems
Management

Accelerators
Hadoop
System

Stream
Computing

Data
Warehouse

Information Integration & Governance

18

Speed time to
value with
analytic and
application
accelerators
Analyze
streaming data
and large data
bursts for realtime insights

Deliver deep
insight with
advanced
in-database
analytics and
operational
analytics

2015 IBM Corporation

Information Management Skills Development

Hadoop and the Enterprise


Ingestion and Real-time Analytic Zone
Streams

Analytics and
Reporting Zone
Warehousing Zone
BI &
Reporting

Connectors

Enterprise
Warehouse
Predictive
Analytics

Hadoop
MapReduce

Hive/HBa
se
Col Stores

Data Marts
Visualization &
Discovery

Documents
in variety of formats

Landing and Analytics Sandbox Zone


19

ETL, MDM, Data Governance

Metadata and Governance Zone


2015 IBM Corporation

Information Management Skills Development

IBM BigInsights for Apache Hadoop


Analytical platform for persistent Big Data
100% open source core with add-on IBM
technologies for Data Analysts, Data Scientists,
and Enterprise Administrators
On premise installation or cloud offerings
Distinguishing characteristics
Built-in analytics . . . . Enhances business
knowledge
Enterprise software integration . . . .
Complements and extends existing capabilities
Production-ready platform . . . Speeds time-tovalue

Hadoop
System

IBM advantage

20

Combination of software, hardware, services and

2015 IBM Corporation

Information Management Skills Development

Overview of IBM BigInsights 4.0


Free Quick Start Edition
IBM BigInsights
Data Scientist
Text Analytics

IBM BigInsights
Analyst
Industry Standard
SQL (Big SQL)
Spreadsheet-style
tool (Big Sheets)

Machine Learning on
Big R
Big R (R support)
Big SQL
Big Sheets

(non production):

IBM Open Platform


BigInsights Analyst, Data Scientist
features
Community Support

IBM BigInsights
Enterprise Management
POSIX Distributed
Filesystem
Multi-workload, Multi-tenant
Scheduling

...

IBM Open Platform with Apache Hadoop*


(HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig,
Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider)
*IBM Open Platform with Apache Hadoop is a 100% open source Apache Hadoop distribution.
IBM will include the Open Data Platform common kernel once available.
21

2015 IBM Corporation

Information Management Skills Development

22

2015 IBM Corporation

Information Management Skills Development

Open Data Platform Overview


Develop an industry standard Big Data Management Platform with Hadoop
What is ODP?
Open-source, non-profit entity
A focused, committed investment in evolving the current state of the platform
Delivering a foundation certified, packaged, and tested Reference Distribution
Simplifies upstream and downstream qualification efforts
test once, use everywhere

The entire industry is enabled to create big data offerings using this reference
implementation
Apache: Complementary to the great work happening today
Apache creates source artifacts inside projects, the foundation will create a reference
implementation of the fully integrated platform

23

2015 IBM Corporation

Information Management Skills Development

Open Data Platform Initiative


Why is IBM involved?
Strong history of leadership in open source & standards
Supports our commitment to open source currency in all future
releases
Accelerates our innovation within Hadoop & surrounding applications

Open Data Platform (ODP) and Apache Software


Foundation (ASF)
ODP supports the ASF mission
ASF provides a governance model around individual projects without
looking at ecosystem
ODP aims to provide a vendor-led consistent packaging model for
core Apache components as an ecosystem

24

2015 IBM Corporation

Information Management Skills Development

ODP Impact in IBM Open Platform with Apache Hadoop


BigInsights Version 4.0 release will not feature ODP
ODP will initially feature bare minimum components
HDFS, YARN, MapReduce, Ambari
Will expand over time
Expect incorporation in BigInsights by Q2 2015
Goals for BigInsights on ODP
Better compatibility and less testing against ecosystem software ( ie. SAS )
Enable value-adds to run on other ODP-Certified Hadoop distros
Standard Apache Hadoop Open Source Components
HDFS
HDFS

MapReduce
MapReduce

Spark
Spark

Hive
Hive

HCatalog
HCatalog

Pig
Pig

YARN
YARN

Ambari
Ambari

HBase
HBase

Flume
Flume

Sqoop
Sqoop

Solr/Lucene
Solr/Lucene

ODP (future)

25

2015 IBM Corporation

Information Management Skills Development

Three Things to Know about IBM BigInsights 4.0


1. More flexible packaging!

IBM InfoSphere BigInsights is now IBM BigInsights for Apache Hadoop


Data Science positioning resonating with market
Open Source components are available independently from IBMs value-added
capabilities

2. Features Data Scientists and Analysts will LOVE!

Machine Learning using Big R for Data Scientists


Big SQL enhancements for analysts & developers
Current Open Source Apache packages

3. IBM is committed to open source!

26

IBM named platinum sponsor of Open Data Platform initiative


Goal is to drive both open source and standards to accelerate innovation
2015 IBM Corporation

Information Management Skills Development

Persona

Enabling Roles with BigInsights 4.0 Modules

IBM Value

Need

Business Analyst

27

Data Scientist

Discover data for analysis


Visualize data for action
Reduce learning curve by
leveraging existing skills
(SQL, Spreadsheets)

Identify patterns, trends,


insights with machine
learning algorithms
Apply statistical models to
large scale data

Complete and Fast


Big SQL runs 100% of
Hadoop-DS queries and 3.6x
times faster query time over
Impala (Audited Hadoop-DS
benchmark)

Customer Insight
Large financial services
company analyzed 4 billion
tweets and identified 110 million
client profiles that matched with
at least 90 percent precision

Administrator
Manage workloads and
schedule jobs to ensure
performance
Secure environment to reduce
risk

Performance
4x improvement in running
MapReduce jobs ( STAC
report )

2015 IBM Corporation

Information Management Skills Development

New Capabilities in IBM BigInsights 4.0


BigInsights Data Scientist Module: Accelerate data science teams with advanced analytics

1
2
3

to extract valuable insights from Hadoop


Big R: Statistical analysis & distributed frames using entire Hadoop cluster
Machine Learning: Machine Learning algorithms in R optimized for Hadoop
Text Analytics: Text extraction via business web tooling

BigInsights Analyst Module: Leverage existing skills to find and visualize data across all
sources including Hadoop
Big SQL: Hbase support, High Availability
Big Sheets: Geospatial support & Big SQL Integration

BigInsights Enterprise Management Module: Ensure scalability, performance and security


of Hadoop clusters
Multi-tenant scheduling
Multi-instance support with data isolation

IBM Open Platform with Apache Hadoop: Free (with optional paid support) product use

4
28

version of 100% Open Source Apache Hadoop distribution

Apache Spark: High-performance and flexible big data processing framework


Apache Ambari: Hadoop cluster administration GUI
Open Source: Currency updates, including Hadoop 2.6
2015 IBM Corporation

Information Management Skills Development

About the IBM Open Platform with Apache Hadoop 4.0


All existing components from v3.0 have been updated to the latest versions
Native support for rolling upgrades for Hadoop services
Completed or in process application work restarts at point of restart, not beginning

Support for long-running applications within YARN for enhanced reliability & security
Heterogeneous storage in HDFS for in-memory, SSD in addition to HDD
Optimize applications based on data access speed required

New Open Source Additions


Spark 1.2.1 In-memory distributed compute engine
Dramatic performance increases over MapReduce
Emerging capabilities for streaming, SQL, machine learning & graph processing
Simplifies developer experience, leveraging Java, Python & Scala languages

Key capability for advanced Hadoop & Data Scientist users


Ambari 1.7 - Operational framework for provisioning, managing & monitoring Apache
Hadoop clusters
Installation of individual components increases speed to value
IBM Open Platform with Apache Hadoop

29

Hadoop
HDFS/MapReduce/YARN*

Ambari*

Avro

Flume

HBase

Hive

Knox

Open JDK

Oozie

Pig

Parquet

Sqoop

Snappy

Solr

Slider

Spark

Zookeeper
2015 IBM Corporation

Information Management Skills Development

About the IBM Open Platform with Apache Hadoop


Decouple Apache Hadoop from IBM value-add technologies
Separate, distinct module for Hadoop foundation
100% open source code
Commitment to currency: days, not months
Flexible platform for processing large volumes of data
Includes Apache Hadoop and many popular open source projects
in the Hadoop Ecosystem
Supports wide variety of data sources
Supports variety of popular APIs (industry-standard SQL, MapReduce, etc)
Enables applications to work with thousands of nodes and petabytes of data in a highly
parallel, cost effective manner
CPU + disks = node
Nodes can be combined into clusters
New nodes can be added as needed without changing
30

Data formats

2015 IBM Corporation

Information Management Skills Development

Open Source Currency


Timely Updates as new open source versions released
Component levels as of March 2015:

Install only components you need / want


Ambari approach expected to align with
future direction of the Open Data
Platform Initiative
( industry consortium dedicated to
development of a common Hadoop core )

31

2015 IBM Corporation

Information Management Skills Development

Additional open source components


Optional projects with non-Apache licensing agreements
Use as desired, complements IBM Open Platform for Apache Hadoop

32

2015 IBM Corporation

Information Management Skills Development

IBM BigInsights for Apache Hadoop Offering Suite

IBM BigInsights v4

Apache Hadoop Stack:


HDFS, YARN, MapReduce,
Ambari, Hbase, Hive, Oozie,
Parquet, Parquet Format, Pig,
Snappy, Solr, Spark, Sqoop,
Zookeeper, Open JDK, Knox,
Slider
Big SQL 100% ANSI
compliant, high performant,
secure SQL engine
Big Sheets spreadsheet-like
interface for discovery &
visualization
Big R advanced statistical &
data mining
Machine Learning with Big
R machine learning
algorithms apply to Hadoop
data set
Advanced Text Analytics
visual tooling to annotate
automated text extraction
Enterprise Mgmt Enhanced
cluster & resource mgmt &
POSIX-compliant file systems

Elite
IBM
Support for
Open
BigInsights BigInsights
BigInsights
IBM Open BigInsights
BigInsights
Platform
Data
Enterprise
Quick Start
Platform
Analyst
for Apache
with
Scientist Management
Edition
with
Module
Hadoop
Apache
Module
Module
Apache
Hadoop
Hadoop

* Paid support for IBM Open Platform with Apache


Hadoop
modules *
* required for BigInsights
*

2015 IBM Corporation

33

Governance Catalog

Information Management Skills Development

Pricing & Licensing


Products

BigInsights
Quick Start

Pricing Terms

Free

Support
provided
Usage
License
Pricing Model
Access via

Nonproduction,
five node cap
Free

Community

IBM 24x7 support


Production Usage

Free

ibm.com/hadoop

Node based pricing


Passport Advantage

Community support via Hadoop Dev

34

Community

Elite Support
IBM Open
BigInsights
for IBM
BigInsights
Platform
BigInsights
Enterprise BigInsights
Open
Data
with
Analyst
Managemen for Apache
Platform
Scientist
Apache
Module
t
Hadoop
with Apache
Module
Hadoop
Module
Hadoop
Yearly
Free
Subscription
Perpetual or Monthly License
Only

Over 100,000 visitors since inception


Modeled after StackOverflow, most popular developer Q&A site on web

2015 IBM Corporation

Information Management Skills Development

Summary
In this Module you learned about:
Big Data
IBM Open Platform with Apache Hadoop
IBM BigInsights 4.0 ( IBM BigInsights for Apache Hadoop )
Open Data Platform (ODP) initiative

35

2015 IBM Corporation

Information Management Skills Development

The next steps

2015 IBM Corporation

Information Management Skills Development

The Next Steps


Complete the online quiz for this module
Log onto IBPOLP, go to My Learning page, and select the In Progress tab.
Find the module and select the quiz
Provide feedback on the module
Log onto IBPOLP, go to My Learning page
Find the module and select the Leave Feedback button to leave your comments

37

2015 IBM Corporation

Information Management Skills Development

The Next Steps


The next set of modules to consider :
10501 Hadoop
10502 YARN/MapReduce2
Additional Reading Material
IBM Open Platform with Apache Hadoop 4.0 Documentation
http://
www-01.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.
biginsights.welcome.doc/doc/welcome.html

InfoSphere BigInsights for Hadoop Community


https://developer.ibm.com/hadoop/

Big Data University


www.bigdatauniversity.com
38

2015 IBM Corporation

Information Management Skills Development

Questions?
askdata@ca.ibm.com

12/13/16
2015 IBM Corporation