You are on page 1of 28

Sparking your

Knowledge with
Azure Spark

Data Platform Airlift


21 de Outubro \\ Microsoft Lisbon Experience
Industry "Microsoft’s comprehensive hybrid
story, which spans applications and
validation platforms as well as infrastructure, is
highly attractive to many companies,
drawing them towards the cloud in
general.”
LYDIA LEONG,
GARTNER

Microsoft Leads Everywhere…


Public Cloud IaaS (May 2015) Cloud Storage (June 2015) Enterprise App PaaS (Jan 2014) X86 Server Virt (July 2015) Operational DBMS Systems (Oct 2014)
Huge infrastructure scale is the enabler
24 Regions Worldwide, 19 ONLINE…huge capacity around the world…growing every year

North Central US
Illinois
North Europe
West Europe
Ireland
Canada Central Netherlands
Central US Toronto Canada East
Iowa Quebec City
China North *
US Gov Beijing
Iowa
Japan East
China South * Saitama
Shanghai
West US East US
California Virginia
India Central Japan West
Pune Osaka
East US 2
South Central US Virginia India South
Texas US Gov Chennai
India West
Virginia
Mumbai East Asia
Hong Kong

SE Asia
Singapore

Australia East
New South Wales

Brazil South
Sao Paulo Australia South East

100+ datacenters
Victoria

 Top 3 networks in the world Operational


 2x AWS, 6x Google DC Regions Announced/Not Operational
 G Series – Largest VM in World, 32 cores, 448GB Ram, SSD… * Operated by 21Vianet
Platform Services

Security & Hybrid


Management Cloud Service API API
Operations
Services Fabric Web Apps Visual Studio Azure SDK
Apps Management

Spark with Azure


Portal Azure AD
Connect Health
Batch Remote App Mobile Logic Notification Team Project Application
Active Apps Apps Hubs Insights AD Privileged
Directory Identity
Management

HDInsight
Multi-Factor
Authentication Backup
Storage Biztalk
Queues Services HDInsight Machine SQL SQL Data
Automation Learning Database Warehouse Operational
Insights
Hybrid Service
Connections Bus Redis
Key Vault Data Event Search Import/Export
Factory Hubs Cache

Store /
Marketplace Site
Stream Mobile DocumentDB Tables Recovery
Analytics Engagement
Media Content Delivery
VM Image Gallery Services Network (CDN) StorSimple
& VM Depot

Infrastructure Services
Apache Spark
Overview
Apache Spark – An Unified Framework
An unified, open source, parallel, data processing framework for Big Data Analytics

Spark SQL Spark Spark MLlib GraphX


Interactive Streaming Machine Graph
Queries Stream processing Learning Computation

Spark Core Engine

Standalone
Yarn Mesos
Scheduler

7
Spark - Benefits
Performance Developer Productivity
Using in-memory computing, Spark is Easy-to-use APIs for processing large
considerably faster than Hadoop (100x in datasets.
some tests). Includes 100+ operators for
Can be used for batch and real-time data transforming.
processing.

Ecosystem
Unified Engine
Spark has built-in support for many
Integrated framework includes higher-level
data sources such as HDFS, RDBMS, S3,
libraries for interactive SQL queries, Apache Hive, Cassandra and MongoDB.
processing streaming data, machine
learning and graph processing. Runs on top the Apache YARN resource
A single application can combine all types manager.
of processing

8
Spark – Use cases
Use case Description Users

Data Integration Cleansing and combining data from


Palantir: Data analytics platform
and ETL diverse sources

Gain insight from massive data sets Goldman Sachs: Analytics platform
Interactive
tin ad hoc investigations or regularly Huawei: Query platform in the
analytics
planned dashboards. telecom sector.

High performance Run complex algorithms against Novartis: Genomic Research


batch computation large scale data MyFitnessPal: Process food data

Alibaba: Marketplace Analysis


Predict outcomes to make decisions
Machine Learning Spotify: Music Recommendation
based on input data

Capturing and processing data


Real-time stream Netflix: Recommendation Engine
continuously with low latency and
processing British Gas: Connected Homes
high reliability
9
Spark is fast
Spark is the current (2014) Sort Benchmark winner.
3x faster than 2013 winner (Hadoop).

tinyurl.com/spark-sort
10
… especially for iterative applications
140

120 Hadoop
100

80

60

40

20
Spark 0.9
0
Logistic Regression
Logistic regression on a 100-node cluster with 100 GB of data 11
Creating
Spark Cluster
on Azure
Creating a HDInsight Spark Cluster

13
HDInsight Spark Dashboard

14
HDInsight Spark Resource Manager
The Resource Manager
enables you to control the
number of cores and
amount of memory
allocated to Spark cluster
components and
notebooks.

Increasing the resources


allocated to the Thrift
Server can potentially
improve the performance
with BI Tools

15
Resizing a HDInsight Spark Cluster

16
Developing with
Notebooks
Developing Spark Apps with Notebooks

18
Interactive
Queries with
SparkSQL
Spark SQL Overview

20
Integration with BI Reporting Tools

21
Machine
Learning with
Spark MLlib
What is MLlib?

Type Algorithms

23
Movie Recommendation – Dataset

• It also includes movie metadata and user profiles. (not needed for recommendation)

User Ratings Data (u.data)


196 242 3 881732314
198 302 3 883894932
22 377 1 883443433
145 51 2 886570342
187 356 4 885634452
166 63 5 886554545

User Id Movie Id User’s Timestamp


Rating 24
Go Try Yourself
Azure Site Recovery: Protect VMWare and Physical Servers
in Public Preview
Azure Backup Generally Available
Azure API Management Premium simplifies high availability and
massive scale for APIs
ExpressRoute for Office 365
Azure Active Directory Dynamic Membership For Groups
Automatic Password Change for Social Media Shared Accounts
Compute-Intensive A10 and A11 Virtual Machine Instances
Remote Desktop app for Windows Phone support for Gateway
and Remote Resources
Informatica Cloud Agent availability in Linux and Windows Virtual
Machines
Azure DocumentDB Hadoop Connector
Azure HDInsight support for more VM sizes
Enterprise-Grade Array-Based Replication and Disaster Recovery
Try SQL Server
2016 CTP2
http://aka.ms/trysql2016

Free Azure
Trial
http://aka.ms/tryazure
Use Power BI for Free
http://powerbi.microsoft.com

You might also like