Big Data Approaches

Big Data Approaches
Making Sense of Big Data
Ian Crosland
Jan 2016
Accelerate Big Data ROI
Even firms that are investing in Big Data are still

struggling to get the most from it.
Make Big Data Deliver Big Data Keep Big Data

Accessible In Context Relevant
Qliks platform drives higher ROI by delivering big data in context with
other data to ensure that Big Data stays relevant.
Partner for Success
A strong culture of Partnership
Broad range
of technology partnerships
Dedicated staff ensures HADOOP ACCELERATORS

continued focus
DATABASES DOMAIN
Continuous evaluation of new
market entrants
And more
Hadoop
Apache Hadoop v2
Hive Pig Mahout Giraph Hive 1.2.0, Tez, Pig,

(SQL) (ETL) (ML) (Graph) Cascading, etc.
MapReduce Other Compute Engines

(Tez, Spark, etc.)
HBase
YARN (Cluster Resource Manager)
HDFS2
Stinger Initiative
Hive on Tez
YARN integration
Distributed execution framework Query Optimisations
Eliminate extra map reads Vectorised query execution
Dataflow model on DAG of nodes Filter at storage layer vs SQL engine
SQL cost based optimiser
ORCFile format
Higher compression 145 developers 44 companies
Columnar Connect via
Ideal for frequent fact filters
ODBC
Source: http://hortonworks.com/labs/stinger/
Impala
Parquet file format
Driven from Twitter use cases
Columnar data storage
Limits IO to data needed
Space saving
Metastore
Can be same DB as Hive
metastore e.g. MySQL
Query optimiser can use
table/column stats
Can use Hbase/HDFS with
several file formats e.g.
Impala RCfile/Parquet
SQL Cost based optimisations
Authentication, AD/Kerberos
YARN integration
In memory caching
Impala Roadmap
Additional SQL support Connect via
S3 integration ODBC
Nested data
Source: http://blog.cloudera.com/blog/2014/08/whats-next-for-impala-focus-on-advanced-sql-functionality/
Other Sources: http://blog.cloudera.com/blog/2015/02/how-to-do-real-time-big-data-discovery-using-cloudera-enterprise-
and-qlik-sense/
Apache Drill
Dynamic Schema Discovery

Does not require schema/type spec to
start query execution
Leverage self describing data formats, Performance
e.g. Parquet,Avro,JSON and NoSQL Distributed execution engine for query
DB processing
Flexible data model built for Columnar execution, avoids disk
complex/semi-structured data access for columns not in query
Vectorisation allows CPU to operate
on record batches
Connect via ODBC Optimistic and pipelined query
execution
ODBC/JDBC
Drill SQL Query Layer Hive UDFs

& Execution Engine Data sources
Drill creates a virtual view in JSON
Nested data support
Metastore MAPR-FS,HDFS,H-Base
Files HBase Hive
SerDes Can use Hive Metastore
JSON, Mongo DB/NoSQL
Reuse Hive UDFs
Source: https://www.mapr.com/products/apache-drill
Spark
Resilient Distributed Datasets
In memory
MR does not lend itself to interactive
/ad-hoc queries
Logical collection of data partitioned
across machines SparkSQL
Can reference external datasets DataFrame distributed collection of
data in named columns
Supported on
RDDs,Parquet,JSON,Hive,JDBC
sources
Connect YARN integration
Market via
Spark distributed with ODBC
ALL hadoop distros
Not just Big Data use
cases Spark Mlib GraphX
Spark Spark R
Streamin Machine Graph
SQL Learning Comp.
R on Spark
g
Spark Core Engine
Alpha/Pre-alpha
Source: https://databricks.com/spark/about
Deeper Dive
Qlik Big Data Methodologies
Different data volumes and complexities are best met
using different methods
On Demand
App Generation
Different methods ensure an
optimized experience for the user for
every situation
Methods can be combined to meet Direct Discovery Chaining

different use cases
Methods vary in deployment

complexity
Data Volume App Complexity

Size (rows) Computational
Dimensions complexity such
(columns) as set analysis
Cardinality Object density
(uniqueness)
In-Memory Segmentation
On Demand App Generation
A shopping cart approach to

analytics
Dimension selection to
generate filtered analytics
On demand data slices
Converting Big Data to small

data analytics
Driven by business users

governed by IT
Sample QlikView Process
Dimensions in list boxes

Selection Conditional show applied to a button to limit amount of data
App Action to invoke analysis document with parameters
ASPX page invoked with selection criteria/user name

Index EDX parameters passed separated in a list and pipe
Technique terminated
Analysis app produced by EDX task

Analysis Data slice limited by populating a
App WHERE clause with the parameters
Sample QlikView Server Process
1. Selection App with dimensional

data is deployed to access point
2. Business User makes multiple

1 4
data selections Data from multiple data
Aggregated
sources is indexed into
data and
a template app based
dimensions
3. Governed selections will drive the on the user selections
indexing of analysis app
4. EDX process indexes analysis

3
app with latest data from
User selection
source(s) criteria drives
Selection Analysis
app creation of analysis app
5. Associative analysis app app
available for business user
2 5
User makes Highly interactive user
personalized experience within a
selections from purpose built analysis
across many data app!
sources
Case Study - Telco
1. Selection App is populated with 2

dimensional data on schedule
Access ASPX
2. User selects dimensional criteria.
Point
After the governed limit is reached
an ASPX page is invoked 4 Selection
3. QMS API and EDX indexes analysis App
app with the most recent data from IIS
QMS
the Teradata database with only
API
the data slice relevant to the user
Analysis
1
4. Analysis app deployed to access App
point with user security
QlikView
Server QlikView
Management
3
Service
EDX
Publisher
Qlik Sense and Elastic Tweets Example
2
1. Tweets are populated into a Elastic APIs
DB via Logstash Proxy,
Engine and
2. User searches for Tweets stored in Repository
Web Page
the Elastic DB from a custom web Index.html
page
3. NodeJS container with Sense

Proxy, Engine and Repository 4
APIs and indexes the analysis app
with the Tweets from the search 3
stored in the Elastic DB. The data
slice contains relevant for the user
Analysis
4. Analysis app updated and Qlik Sense App
published to a shared stream Server
REST
connector
1
Architecture
Authenticate Generate and Publish App
External
Authentication 4242
4243
Qlik Sense Engine API

Qlik Sense Proxy API Qlik Sense Repository API
Gets client certificate Access Index template app

Generates XrfKey Copies template and renames with filter
Send request for ticket added to app name
Authenticates to Sense with ticket Replaces the $search_terms$ in the script
with filter
Generates app with data and publishes to
everyone stream
Thank you

Big Data Approaches

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Approaches

Uploaded by

Copyright:

Available Formats

Big Data Approaches

Making Sense of Big Data

Even firms that are investing in Big Data are still

Make Big Data Deliver Big Data Keep Big Data

A strong culture of Partnership

Dedicated staff ensures HADOOP ACCELERATORS

Hive Pig Mahout Giraph Hive 1.2.0, Tez, Pig,

MapReduce Other Compute Engines

YARN (Cluster Resource Manager)

Dynamic Schema Discovery

Drill SQL Query Layer Hive UDFs

Spark Core Engine

Methods can be combined to meet Direct Discovery Chaining

Methods vary in deployment

Data Volume App Complexity

A shopping cart approach to

On demand data slices

Converting Big Data to small

Driven by business users

Dimensions in list boxes

ASPX page invoked with selection criteria/user name

Analysis app produced by EDX task

1. Selection App with dimensional

2. Business User makes multiple

4. EDX process indexes analysis

1. Selection App is populated with 2

3. NodeJS container with Sense

Authenticate Generate and Publish App

Qlik Sense Engine API

Gets client certificate Access Index template app

You might also like