Data Platform AirLift Sparking Your Knowledge WithAzure Spark

Sparking your
Knowledge with
Azure Spark
Data Platform Airlift

21 de Outubro \\ Microsoft Lisbon Experience
Industry "Microsoft’s comprehensive hybrid
story, which spans applications and
validation platforms as well as infrastructure, is
highly attractive to many companies,
drawing them towards the cloud in
general.”
LYDIA LEONG,
GARTNER
Microsoft Leads Everywhere…

Public Cloud IaaS (May 2015) Cloud Storage (June 2015) Enterprise App PaaS (Jan 2014) X86 Server Virt (July 2015) Operational DBMS Systems (Oct 2014)
Huge infrastructure scale is the enabler
24 Regions Worldwide, 19 ONLINE…huge capacity around the world…growing every year
North Central US
Illinois
North Europe
West Europe
Ireland
Canada Central Netherlands
Central US Toronto Canada East
Iowa Quebec City
China North *
US Gov Beijing
Iowa
Japan East
China South * Saitama
Shanghai
West US East US
California Virginia
India Central Japan West
Pune Osaka
East US 2
South Central US Virginia India South
Texas US Gov Chennai
India West
Virginia
Mumbai East Asia
Hong Kong
SE Asia
Singapore
Australia East
New South Wales
Brazil South
Sao Paulo Australia South East
100+ datacenters
Victoria

 Top 3 networks in the world Operational

 2x AWS, 6x Google DC Regions Announced/Not Operational
 G Series – Largest VM in World, 32 cores, 448GB Ram, SSD… * Operated by 21Vianet
Platform Services
Security & Hybrid

Management Cloud Service API API
Operations
Services Fabric Web Apps Visual Studio Azure SDK
Apps Management
Spark with Azure

Portal Azure AD
Connect Health
Batch Remote App Mobile Logic Notification Team Project Application
Active Apps Apps Hubs Insights AD Privileged
Directory Identity
Management
HDInsight
Multi-Factor
Authentication Backup
Storage Biztalk
Queues Services HDInsight Machine SQL SQL Data
Automation Learning Database Warehouse Operational
Insights
Hybrid Service
Connections Bus Redis
Key Vault Data Event Search Import/Export
Factory Hubs Cache
Store /
Marketplace Site
Stream Mobile DocumentDB Tables Recovery
Analytics Engagement
Media Content Delivery
VM Image Gallery Services Network (CDN) StorSimple
& VM Depot
Infrastructure Services
Apache Spark
Overview
Apache Spark – An Unified Framework
An unified, open source, parallel, data processing framework for Big Data Analytics
Spark SQL Spark Spark MLlib GraphX

Interactive Streaming Machine Graph
Queries Stream processing Learning Computation
Spark Core Engine
Standalone
Yarn Mesos
Scheduler
7
Spark - Benefits
Performance Developer Productivity
Using in-memory computing, Spark is Easy-to-use APIs for processing large
considerably faster than Hadoop (100x in datasets.
some tests). Includes 100+ operators for
Can be used for batch and real-time data transforming.
processing.
Ecosystem
Unified Engine
Spark has built-in support for many
Integrated framework includes higher-level
data sources such as HDFS, RDBMS, S3,
libraries for interactive SQL queries, Apache Hive, Cassandra and MongoDB.
processing streaming data, machine
learning and graph processing. Runs on top the Apache YARN resource
A single application can combine all types manager.
of processing
8
Spark – Use cases
Use case Description Users
Data Integration Cleansing and combining data from

Palantir: Data analytics platform
and ETL diverse sources
Gain insight from massive data sets Goldman Sachs: Analytics platform
Interactive
tin ad hoc investigations or regularly Huawei: Query platform in the
analytics
planned dashboards. telecom sector.
High performance Run complex algorithms against Novartis: Genomic Research

batch computation large scale data MyFitnessPal: Process food data
Alibaba: Marketplace Analysis

Predict outcomes to make decisions
Machine Learning Spotify: Music Recommendation
based on input data
Capturing and processing data

Real-time stream Netflix: Recommendation Engine
continuously with low latency and
processing British Gas: Connected Homes
high reliability
9
Spark is fast
Spark is the current (2014) Sort Benchmark winner.
3x faster than 2013 winner (Hadoop).
tinyurl.com/spark-sort
10
… especially for iterative applications
140
120 Hadoop
100
80
60
40
20
Spark 0.9
0
Logistic Regression
Logistic regression on a 100-node cluster with 100 GB of data 11
Creating
Spark Cluster
on Azure
Creating a HDInsight Spark Cluster
13
HDInsight Spark Dashboard
14
HDInsight Spark Resource Manager
The Resource Manager
enables you to control the
number of cores and
amount of memory
allocated to Spark cluster
components and
notebooks.
Increasing the resources

allocated to the Thrift
Server can potentially
improve the performance
with BI Tools
15
Resizing a HDInsight Spark Cluster
16
Developing with
Notebooks
Developing Spark Apps with Notebooks
18
Interactive
Queries with
SparkSQL
Spark SQL Overview
20
Integration with BI Reporting Tools
21
Machine
Learning with
Spark MLlib
What is MLlib?
Type Algorithms
23
Movie Recommendation – Dataset
• It also includes movie metadata and user profiles. (not needed for recommendation)
User Ratings Data (u.data)

196 242 3 881732314
198 302 3 883894932
22 377 1 883443433
145 51 2 886570342
187 356 4 885634452
166 63 5 886554545
User Id Movie Id User’s Timestamp

Rating 24
Go Try Yourself
Azure Site Recovery: Protect VMWare and Physical Servers
in Public Preview
Azure Backup Generally Available
Azure API Management Premium simplifies high availability and
massive scale for APIs
ExpressRoute for Office 365
Azure Active Directory Dynamic Membership For Groups
Automatic Password Change for Social Media Shared Accounts
Compute-Intensive A10 and A11 Virtual Machine Instances
Remote Desktop app for Windows Phone support for Gateway
and Remote Resources
Informatica Cloud Agent availability in Linux and Windows Virtual
Machines
Azure DocumentDB Hadoop Connector
Azure HDInsight support for more VM sizes
Enterprise-Grade Array-Based Replication and Disaster Recovery
Try SQL Server
2016 CTP2
http://aka.ms/trysql2016
Free Azure
Trial
http://aka.ms/tryazure
Use Power BI for Free
http://powerbi.microsoft.com

Data Platform AirLift Sparking Your Knowledge WithAzure Spark

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Platform AirLift Sparking Your Knowledge WithAzure Spark

Uploaded by

Copyright:

Available Formats

Sparking your

Data Platform Airlift

Microsoft Leads Everywhere…

 Top 3 networks in the world Operational

Security & Hybrid

Spark with Azure

Spark SQL Spark Spark MLlib GraphX

Spark Core Engine

Data Integration Cleansing and combining data from

High performance Run complex algorithms against Novartis: Genomic Research

Alibaba: Marketplace Analysis

Capturing and processing data

Increasing the resources

User Ratings Data (u.data)

User Id Movie Id User’s Timestamp

You might also like