Professional Documents
Culture Documents
Knowledge with
Azure Spark
North Central US
Illinois
North Europe
West Europe
Ireland
Canada Central Netherlands
Central US Toronto Canada East
Iowa Quebec City
China North *
US Gov Beijing
Iowa
Japan East
China South * Saitama
Shanghai
West US East US
California Virginia
India Central Japan West
Pune Osaka
East US 2
South Central US Virginia India South
Texas US Gov Chennai
India West
Virginia
Mumbai East Asia
Hong Kong
SE Asia
Singapore
Australia East
New South Wales
Brazil South
Sao Paulo Australia South East
100+ datacenters
Victoria
HDInsight
Multi-Factor
Authentication Backup
Storage Biztalk
Queues Services HDInsight Machine SQL SQL Data
Automation Learning Database Warehouse Operational
Insights
Hybrid Service
Connections Bus Redis
Key Vault Data Event Search Import/Export
Factory Hubs Cache
Store /
Marketplace Site
Stream Mobile DocumentDB Tables Recovery
Analytics Engagement
Media Content Delivery
VM Image Gallery Services Network (CDN) StorSimple
& VM Depot
Infrastructure Services
Apache Spark
Overview
Apache Spark – An Unified Framework
An unified, open source, parallel, data processing framework for Big Data Analytics
Standalone
Yarn Mesos
Scheduler
7
Spark - Benefits
Performance Developer Productivity
Using in-memory computing, Spark is Easy-to-use APIs for processing large
considerably faster than Hadoop (100x in datasets.
some tests). Includes 100+ operators for
Can be used for batch and real-time data transforming.
processing.
Ecosystem
Unified Engine
Spark has built-in support for many
Integrated framework includes higher-level
data sources such as HDFS, RDBMS, S3,
libraries for interactive SQL queries, Apache Hive, Cassandra and MongoDB.
processing streaming data, machine
learning and graph processing. Runs on top the Apache YARN resource
A single application can combine all types manager.
of processing
8
Spark – Use cases
Use case Description Users
Gain insight from massive data sets Goldman Sachs: Analytics platform
Interactive
tin ad hoc investigations or regularly Huawei: Query platform in the
analytics
planned dashboards. telecom sector.
tinyurl.com/spark-sort
10
… especially for iterative applications
140
120 Hadoop
100
80
60
40
20
Spark 0.9
0
Logistic Regression
Logistic regression on a 100-node cluster with 100 GB of data 11
Creating
Spark Cluster
on Azure
Creating a HDInsight Spark Cluster
13
HDInsight Spark Dashboard
14
HDInsight Spark Resource Manager
The Resource Manager
enables you to control the
number of cores and
amount of memory
allocated to Spark cluster
components and
notebooks.
15
Resizing a HDInsight Spark Cluster
16
Developing with
Notebooks
Developing Spark Apps with Notebooks
18
Interactive
Queries with
SparkSQL
Spark SQL Overview
20
Integration with BI Reporting Tools
21
Machine
Learning with
Spark MLlib
What is MLlib?
Type Algorithms
23
Movie Recommendation – Dataset
• It also includes movie metadata and user profiles. (not needed for recommendation)
Free Azure
Trial
http://aka.ms/tryazure
Use Power BI for Free
http://powerbi.microsoft.com