You are on page 1of 18

Big Data Approaches

Making Sense of Big Data

Ian Crosland
Jan 2016
Accelerate Big Data ROI

Even firms that are investing in Big Data are still


struggling to get the most from it.

Make Big Data Deliver Big Data Keep Big Data


Accessible In Context Relevant

Qliks platform drives higher ROI by delivering big data in context with
other data to ensure that Big Data stays relevant.
Partner for Success

A strong culture of Partnership

Broad range
of technology partnerships

Dedicated staff ensures HADOOP ACCELERATORS


continued focus
DATABASES DOMAIN
Continuous evaluation of new
market entrants

And more
Hadoop
Apache Hadoop v2

Hive Pig Mahout Giraph Hive 1.2.0, Tez, Pig,


(SQL) (ETL) (ML) (Graph) Cascading, etc.

MapReduce Other Compute Engines


(Tez, Spark, etc.)
HBase

YARN (Cluster Resource Manager)

HDFS2
Stinger Initiative
Hive on Tez
YARN integration
Distributed execution framework Query Optimisations
Eliminate extra map reads Vectorised query execution
Dataflow model on DAG of nodes Filter at storage layer vs SQL engine
SQL cost based optimiser

ORCFile format
Higher compression 145 developers 44 companies
Columnar Connect via
Ideal for frequent fact filters
ODBC
Source: http://hortonworks.com/labs/stinger/
Impala
Parquet file format
Driven from Twitter use cases
Columnar data storage
Limits IO to data needed
Space saving
Metastore
Can be same DB as Hive
metastore e.g. MySQL
Query optimiser can use
table/column stats
Can use Hbase/HDFS with
several file formats e.g.
Impala RCfile/Parquet
SQL Cost based optimisations
Authentication, AD/Kerberos
YARN integration
In memory caching

Impala Roadmap
Additional SQL support Connect via
S3 integration ODBC
Nested data
Source: http://blog.cloudera.com/blog/2014/08/whats-next-for-impala-focus-on-advanced-sql-functionality/
Other Sources: http://blog.cloudera.com/blog/2015/02/how-to-do-real-time-big-data-discovery-using-cloudera-enterprise-
and-qlik-sense/
Apache Drill

Dynamic Schema Discovery


Does not require schema/type spec to
start query execution
Leverage self describing data formats, Performance
e.g. Parquet,Avro,JSON and NoSQL Distributed execution engine for query
DB processing
Flexible data model built for Columnar execution, avoids disk
complex/semi-structured data access for columns not in query
Vectorisation allows CPU to operate
on record batches
Connect via ODBC Optimistic and pipelined query
execution
ODBC/JDBC

Drill SQL Query Layer Hive UDFs


& Execution Engine Data sources
Drill creates a virtual view in JSON
Nested data support
Metastore MAPR-FS,HDFS,H-Base
Files HBase Hive
SerDes Can use Hive Metastore
JSON, Mongo DB/NoSQL
Reuse Hive UDFs
Source: https://www.mapr.com/products/apache-drill
Spark
Resilient Distributed Datasets
In memory
MR does not lend itself to interactive
/ad-hoc queries
Logical collection of data partitioned
across machines SparkSQL
Can reference external datasets DataFrame distributed collection of
data in named columns
Supported on
RDDs,Parquet,JSON,Hive,JDBC
sources
Connect YARN integration
Market via
Spark distributed with ODBC
ALL hadoop distros
Not just Big Data use
cases Spark Mlib GraphX
Spark Spark R
Streamin Machine Graph
SQL Learning Comp.
R on Spark
g

Spark Core Engine

Alpha/Pre-alpha
Source: https://databricks.com/spark/about
Deeper Dive
Qlik Big Data Methodologies
Different data volumes and complexities are best met
using different methods

On Demand
App Generation
Different methods ensure an
optimized experience for the user for
every situation

Methods can be combined to meet Direct Discovery Chaining


different use cases

Methods vary in deployment


complexity

Data Volume App Complexity


Size (rows) Computational
Dimensions complexity such
(columns) as set analysis
Cardinality Object density
(uniqueness)

In-Memory Segmentation
On Demand App Generation

A shopping cart approach to


analytics

Dimension selection to
generate filtered analytics

On demand data slices

Converting Big Data to small


data analytics

Driven by business users


governed by IT
Sample QlikView Process

Dimensions in list boxes


Selection Conditional show applied to a button to limit amount of data
App Action to invoke analysis document with parameters

ASPX page invoked with selection criteria/user name


Index EDX parameters passed separated in a list and pipe
Technique terminated

Analysis app produced by EDX task


Analysis Data slice limited by populating a
App WHERE clause with the parameters
Sample QlikView Server Process

1. Selection App with dimensional


data is deployed to access point

2. Business User makes multiple


1 4
data selections Data from multiple data
Aggregated
sources is indexed into
data and
a template app based
dimensions
3. Governed selections will drive the on the user selections
indexing of analysis app

4. EDX process indexes analysis


3
app with latest data from
User selection
source(s) criteria drives
Selection Analysis
app creation of analysis app
5. Associative analysis app app
available for business user

2 5
User makes Highly interactive user
personalized experience within a
selections from purpose built analysis
across many data app!
sources
Case Study - Telco

1. Selection App is populated with 2


dimensional data on schedule
Access ASPX
2. User selects dimensional criteria.
Point
After the governed limit is reached
an ASPX page is invoked 4 Selection
3. QMS API and EDX indexes analysis App
app with the most recent data from IIS
QMS
the Teradata database with only
API
the data slice relevant to the user
Analysis
1
4. Analysis app deployed to access App
point with user security
QlikView
Server QlikView
Management
3
Service

EDX

Publisher
Qlik Sense and Elastic Tweets Example
2
1. Tweets are populated into a Elastic APIs
DB via Logstash Proxy,
Engine and
2. User searches for Tweets stored in Repository
Web Page
the Elastic DB from a custom web Index.html
page

3. NodeJS container with Sense


Proxy, Engine and Repository 4
APIs and indexes the analysis app
with the Tweets from the search 3
stored in the Elastic DB. The data
slice contains relevant for the user
Analysis
4. Analysis app updated and Qlik Sense App
published to a shared stream Server
REST
connector

1
Architecture

Authenticate Generate and Publish App

External
Authentication 4242
4243

Qlik Sense Engine API


Qlik Sense Proxy API Qlik Sense Repository API

Gets client certificate Access Index template app


Generates XrfKey Copies template and renames with filter
Send request for ticket added to app name
Authenticates to Sense with ticket Replaces the $search_terms$ in the script
with filter
Generates app with data and publishes to
everyone stream
Thank you

You might also like