You are on page 1of 41

About This Specialization

Drive better business decisions with an overview of how big data is organized, analyzed, and interpreted. Apply your insights
to real-world problems and questions.
Do you need to understand big data and how it will impact your business? This Specialization is for you. You will gain an
understanding of what insights big data can provide through hands-on experience with the tools and systems used by big
data scientists and engineers. Previous programming experience is not required! You will be guided through the basics of
using Hadoop with MapReduce, Spark, Pig and Hive. By following along with provided code, you will experience how one can
perform predictive modeling and leverage graph analytics to model problems. This specialization will prepare you to ask the
right questions about data, communicate effectively with data scientists, and do basic exploration of large, complex data-
sets. In the final Capstone Project, developed in partnership with data software company Splunk, you’ll apply the skills you
learned to do basic analyses of big data.

6 courses Projects Certificates


Follow the suggested Follow the suggested Follow the suggested
order or choose your order or choose your order or choose your
own own own
      



   

 6 weeks of study, 5-8 hours/week

 English, Persian

About the Course


Interested in increasing your knowledge of the Big Data landscape? This course is for those new to data science and
interested in understanding why the Big Data Era has come to be. It is for those who want to become conversant with the
terminology and the core concepts behind big data problems, applications, and systems. It is for those who want to start
thinking about how Big Data might be useful in their business or career. It provides an introduction to one of the most
common frameworks, Hadoop, that has made big data analysis easier and more accessible -- increasing the potential for
data to transform our world!

At the end of this course, you will be able to:

* Describe the Big Data landscape including examples of real world big data problems including the three key sources of
Big Data: people, organizations, and sensors.

* Explain the V’s of Big Data (volume, velocity, variety, veracity, valence, and value) and why each impacts data collection,
monitoring, storage, analysis and reporting.
* Get value out of Big Data by using a 5-step process to structure your analysis.

* Identify what are and what are not big data problems and be able to recast big data problems as data science questions.

* Provide an explanation of the architectural components and programming models used for scalable big data analysis.

* Summarize the features and value of core Hadoop stack components including the YARN resource and job management
system, the HDFS file system and the MapReduce programming model.

* Install and run a program using Hadoop!

This course is for those new to data science. No prior programming experience is needed, although the ability to install
applications and utilize a virtual machine is necessary to complete the hands-on assignments.

Hardware Requirements:
(A) Quad Core Processor (VT-x or AMD-V support recommended), 64-bit; (B) 8 GB RAM; (C) 20 GB disk free. How to find your
hardware information: (Windows): Open System by clicking the Start button, right-clicking Computer, and then clicking
Properties; (Mac): Open Overview by clicking on the Apple menu and clicking “About This Mac.” Most computers with 8 GB
RAM purchased in the last 3 years will meet the minimum requirements.You will need a high speed internet connection
because you will be downloading files up to 4 Gb in size.

Software Requirements:
This course relies on several open-source software tools, including Apache Hadoop. All required software can be download-
ed and installed free of charge. Software requirements include: Windows 7+, Mac OS X 10.10+, Ubuntu 14.04+ or CentOS 6+
VirtualBox 5+.
WEEK 1
Welcome
Welcome to the Big Data Specialization! We're excited for you to get to know us and we're looking forward to learning
about you!

Video · Welcome to the Big Data Specialization Video · Tell us about yourself and learn about
your classmates
Reading · By the end of this course you will be able to... Other · Let's Discuss: Why are you taking this
class?
Reading · Optional: Watch this fun video about
the San Diego Supercomputer Center!

Big Data: Why and Where


Data, it's been around (even digitally) for a while. What makes data "big" and where does this big data come from?

Video · What launched the Big Data era? Video · Getting Started: Where Does Big Data
Come From?
Video · Applications: What makes big data valuable Video · Machine-Generated Data: It's Everywhere
and There's a Lot!
Other · Let's Discuss: What application Video · Big Data Generated By People:
area interests you? The Unstructured Challenge
Video · Example: Saving lives with Big Data Video · Big Data Generated By People:
How Is It Being Used?
Video · Example: Using Big Data to Help Patients Video · Organization-Generated Data:
Structured but often siloed
Video · A Sentiment Analysis Success Story:
Meltwater helping Danone Video · Organization-Generated Data: Benefits
Reading · Did you know?: 25 facts about big data Come From Combining With Other Data Types

Video · The Key: Integrating Diverse Data


Reading · Slides: What Launched the Big Data Era?

Other · Let's discuss: Who are you providing data to?


Reading · Slides: Applications: What Makes Big
Data Valuable?
Reading · Slides: Saving Lives With Big Data Quiz · Why Big Data and Where Did it Come From?

Reading · Slides: Using Big Data to Help Patients


Reading · Slides: Organization-Generated Big Data:
Reading · Extra Resources Structured But Often Siloed

Reading · Slides: Machine-Generated Data: Reading · Slides: Organizaton-Generated Big Data:


It's Everywhere and There's a Lot! Benefits

Reading · Slides: Machine-Generated Data: Reading · Slides: The Key - Integrating Diverse Dataa
Advantages
Reading · Slides: Big Data Generated By People:
The Unstructured Challenge
Reading · Slides: Big Data Generated By People:
How is it Being Used?
WEEK 2
Characteristic Of Big Data and Dimension of Scalability
You may have heard of the "Big Vs". We'll give examples and descriptions of the commonly discussed 5. But, we want to
propose a 6th V and we'll ask you to practice writing Big Data questions targeting this -- value

Video · Getting Started: Characteristics Of Big Data Reading · Slides: Getting Started
- Characteristics of Big Data
Video · Characteristics of Big Data - Volume Reading · Slides: Characteristics of Big Data - Volume

Reading · What does astronomical scale mean? Reading · Slides: Characteristics of Big Data - Variety

Video · Characteristics of Big Data - Variety Reading · Slides: Characteristics of Big Data - Velocity

Video · Characteristics of Big Data - Velocity Reading · Slides: Characteristics of Big Data - Veracity

Video · Characteristics of Big Data - Veracity Reading · Slides: Characteristics of Big Data - Value

Video · Characteristics of Big Data - Valence Reading · Slides: Characteristics of Big Data - Valence

Video · The Sixth V: Value

Reading · A Small Definition of Big Data

Quiz · for the V's of Big Data

Other · Practice: Writing Big Data questions

Other · Let's Discuss: Improving the Flamingo Game


Data Science: Getting Value out of Big Data
We love science and we love computing, don't get us wrong. But the reality is we care about Big Data because it can bring
value to our companies, our lives, and the world. In this module we'll introduce a 5 step process for approaching data
science problems.

Video · Data Science: Getting Value out of Big Data Reading · Slides: Getting Value Out of
Big Data
Video · Building a Big Data Strategy Reading · Slides: Building a Big Data
Strategy
Video · How does big data science happen?: Five Components of Reading · Slides: The Five P's of Data
Data Science Science
Reading · Five P's of Data Science Reading · Slides: Asking the Right Questions

Other · Let's Discuss: Thinking more deeply about the Ps Reading · Slides: Steps in the Data Science
Process
Video · Asking the Right Questions Reading · Slides: Step 1 - Acquiring Data

Video · Steps in the Data Science Process Reading · Slides: Step 2B-Preprocessing
Data
Video · Step 1: Acquiring Data Reading · Slides: Step 3-Data Analysis

Video · Step 2-A: Exploring Data Reading · Slides: Step 4-Communicating


Results

Video · Step 2-B: Pre-Processing Data Reading · Slides: Step 5-Turning Insights
Into Action

Video · Step 3: Analyzing Data

Video · Step 4: Communicating Results

Video · Step 5: Turning Insights into Action

Other · Let's Discuss: Building a Team

Quiz · Data Science 101


Week 3
Foundations for Big Data Systems and Programming
Big Data requires new programming frameworks and systems. For this course, we don't programming knowledge or
experience -- but we do want to give you a grounding in some of the key concepts.

Video · Getting Started: Why worry about foundations? Reading · Slides: Getting Started
-Why Worry About Foundations?
Video · What is a Distributed File System? Reading · Slides: What is a Distributed File System?

Video · Scalable Computing over the Internet Reading · Slides: Scalable Computing Over the
Internet
Video · Programming Models for Big Data Reading · Slides: Programming Models for Big
Data

Quiz · Foundations for Big Data

Systems: Getting Started with Hadoop


Let's look at some details of Hadoop and MapReduce. Then we'll go "hands on" and actually perform a simple MapReduce
task in the Cloudera VM. Pay attention - as we'll guide you in "learning by doing" in diagramming a MapReduce task as a
Peer Review.

Video · Hadoop: Why, Where and Who? Video · Cloud Service Models: An Exploration
of Choices
Video · The Hadoop Ecosystem: Welcome to the zoo! Video · Value From Hadoop and Pre-built
Hadoop Images
Video · The Hadoop Distributed File System: A Storage Reading · Slides for Getting Started With
System for Big Data Hadoop
Video · YARN: A Resource Manager for Hadoop
Quiz · Intro to Hadoop
Video · MapReduce: Simple Programming for Big Results
Peer Review · Understand by Doing:
MapReduce
Reading · MapReduce in the Pasta Sauce Example
Reading · Downloading and Installing the
Video · When to Reconsider Hadoop? Cloudera VM Instructions (Mac)

Reading · Downloading and Installing the


Video · Cloud Computing: An Important Big Data Enabler
Cloudera VM Instructions (Windows)

Reading · FAQ
Week 3
Foundations for Big Data Systems and Programming
Big Data requires new programming frameworks and systems. For this course, we don't programming knowledge or
experience -- but we do want to give you a grounding in some of the key concepts.

Video · Getting Started: Why worry about foundations? Reading · Slides: Getting Started
-Why Worry About Foundations?
Video · What is a Distributed File System? Reading · Slides: What is a Distributed File System?

Video · Scalable Computing over the Internet Reading · Slides: Scalable Computing Over the
Internet
Video · Programming Models for Big Data Reading · Slides: Programming Models for Big
Data

Quiz · Foundations for Big Data

Systems: Getting Started with Hadoop


Let's look at some details of Hadoop and MapReduce. Then we'll go "hands on" and actually perform a simple MapReduce
task in the Cloudera VM. Pay attention - as we'll guide you in "learning by doing" in diagramming a MapReduce task as a
Peer Review.

Video · Hadoop: Why, Where and Who? Video · Cloud Service Models: An Exploration
of Choices
Video · The Hadoop Ecosystem: Welcome to the zoo! Video · Value From Hadoop and Pre-built
Hadoop Images
Video · The Hadoop Distributed File System: A Storage Reading · Slides for Getting Started With
System for Big Data Hadoop
Video · YARN: A Resource Manager for Hadoop
Quiz · Intro to Hadoop
Video · MapReduce: Simple Programming for Big Results
Peer Review · Understand by Doing:
MapReduce
Reading · MapReduce in the Pasta Sauce Example
Reading · Downloading and Installing the
Video · When to Reconsider Hadoop? Cloudera VM Instructions (Mac)

Reading · Downloading and Installing the


Video · Cloud Computing: An Important Big Data Enabler
Cloudera VM Instructions (Windows)

Reading · FAQ
       

Big Data Modeling and Management Systems


Upcoming Session: Dec 4

Commitment 6 weeks of study, 5-8 hours/week

Subtitles English

About the Course


Once you’ve identified a big data issue to analyze, how do you collect, store and organize your data using Big Data
solutions? In this course, you will experience various data genres and management tools appropriate for each. You will be
able to describe the reasons behind the evolving plethora of new big data platforms from the perspective of big data
management systems and analytical tools. Through guided hands-on tutorials, you will become familiar with techniques
using real-time and semi-structured data examples. Systems and tools discussed include: AsterixDB, HP Vertica, Impala,
Neo4j, Redis, SparkSQL. This course provides techniques to extract value from existing untapped data sources and discover-
ing new data sources.

At the end of this course, you will be able to:


* Recognize different data elements in your own work and in everyday life problems
* Explain why your team needs to design a Big Data Infrastructure Plan and Information System Design
* Identify the frequent data operations required for various types of data
* Select a data model to suit the characteristics of your data
* Apply techniques to handle streaming data
* Differentiate between a traditional Database Management System and a Big Data Management System
* Appreciate why there are so many data management systems
* Design a big data information system for an online game company

This course is for those new to data science. Completion of Intro to Big Data is recommended. No prior programming
experience is needed, although the ability to install applications and utilize a virtual machine is necessary to complete the
hands-on assignments. Refer to the specialization technical requirements for complete hardware and software specifica-
tions.

Hardware Requirements:
(A) Quad Core Processor (VT-x or AMD-V support recommended), 64-bit; (B) 8 GB RAM; (C) 20 GB disk free. How to find your
hardware information: (Windows): Open System by clicking the Start button, right-clicking Computer, and then clicking
Properties; (Mac): Open Overview by clicking on the Apple menu and clicking “About This Mac.” Most computers with 8 GB
RAM purchased in the last 3 years will meet the minimum requirements.You will need a high speed internet connection
because you will be downloading files up to 4 Gb in size.

Software Requirements:
This course relies on several open-source software tools, including Apache Hadoop. All required software can be download-
ed and installed free of charge (except for data charges from your internet provider). Software requirements include:
Windows 7+, Mac OS X 10.10+, Ubuntu 14.04+ or CentOS 6+ VirtualBox 5+
Week 1
Introduction to Big Data Modeling and Management
Welcome to this course on big data modeling and management. Modeling and managing data is a central focus of all big
data projects. In these lessons we introduce you to the concepts behind big data modeling and management and set the
stage for the remainder of the course.

Video · Welcome to Big Data Modeling and Management Video · Energy Data Management Challenges at ConEd

Video · Why is this a New Course in the Big Data Reading · Slides: Energy Data Management
Specialization? Challenges at ConEd
Other · Getting to know you: Tell us about yourself Video · Gaming Industry Data Management:
and why you are taking this course Q&A with Apmetrix CTO Mark Caldwell
Video · Flight Data Management at FlightStats:
Video · Summary of Introduction to Big Data (Part 1) A Lecture by CTO Chad Berkley
Reading · Slides: Flight Data Management at FlightStats
Video · Summary of Introduction to Big Data (Part 2)
Other · Let's discuss: What are the design criteria in
Video · Summary of Introduction to Big Data (Part 3) the big data applications you have heard?

Reading · Downloading and Installing the Cloudera


Reading · Slides: Summary of Introduction to Big Data
VM Instructions (Windows)
Reading · Downloading and Installing the Cloudera
Video · Big Data Management "Must-Ask Questions" VM Instructions (Mac)
Reading · Instructions for Downloading Hands On
Video · Data Ingestion
Datasets

Video · Data Storage

Video · Data Quality

Video · Data Operations

Video · Data Scalability and Security

Reading · Slides: Big Data Management

Other · Let's discuss: What area of big data management interests you most?

Reading · Reading on Storage Systems


Week 2
Introduction to Big Data Modeling and Management
Big Data Modeling
Modeling big data depends on many factors including data structure, which operations may be performed on the data, and
what constraints are placed on the models. In these lessons you will learn the details about big data modeling and you will
gain the practical skills you will need for modeling your own big data projects.

Video · Introduction to Data Models Reading · Exploring the Relational Data


Model of Comma separated Values (CSV)
Video · Data Model Structures Video · Exploring the Relational Data Model
of CSV Files
Video · Data Model Operations Reading · Exploring the Semistructured Data Model
of JSON data
Video · Data Model Constraints Video · Exploring the Semistructured Data Model
of JSON data
Reading · Slides: What Is A Data Model? Reading · Exploring the Array Data Model of an Image

Other · Let's discuss: Modeling data in your daily life Video · Exploring the Array Data Model of an Image?

Reading · Introduction to CSV Data Reading · Exploring Sensor Data

Video · Introduction to CSV Data Video · Exploring Sensor Data

Quiz · Practical Quiz for Week 2 Hands-On Lectures


Video · What is a Relational Data Model?

Reading · Slides: What Is A Relational Data Model?

Video · What is a Semistructured Data Model?

Reading · Slides: What is a Semistructured Data Model?

Other · Let's discuss: Utilization of XML or JSON on the Internet

Other · Let's discuss: What area of big data management interests you most?

Reading · Reading on Storage Systems


Week 3
Big Data Modeling (Part 2)
These lessons continue to shed light on big data modeling with specific approaches including vector space models, graph
data models, and more.

Video · Vector Space Model Reading · Exploring Vector Data Models with Lucene

Reading · Slides: Vector Space Model Video · Exploring the Lucene Search Engine's Vector
Data Model

Video · Graph Data Model Reading · Exploring Graph Data Models with Gephi

Reading · Slides: Graph Data Model Video · Exploring Graph Data Models with Gephi

Video · Other Data Models

Reading · Slides: Other Data Models

Quiz · Data Models Quiz


Week 4
Working With Data Models
Data models deal with many different types of data formats. Streaming data is becoming ubiquitous, and working with
streaming data requires a different approach from working with static data. In these lessons you will gain practical
hands-on experience working with different forms of streaming data including weather data and twitter feeds

Video · Data Model vs. Data Format Reading · Exploring Streaming Sensor Data

Reading · Slides: Data Model vs. Data Format Video · Exploring Streaming Sensor Data

Reading · Instructions for Creating a Twitter App


Video · What is a Data Stream? (Optional)
Reading · Exploring Streaming Twitter Data
Reading · Slides: What is a Data Stream? (Optional)
Video · Exploring Streaming Twitter Data
Video · Why is Streaming Data different? (Optional)

Reading · Slides: Why is Streaming Data Different?

Video · Understanding Data Lakes

Reading · Slides: Understanding Data Lakes

Quiz · Data Formats and Streaming Data Quiz

Other · Let's discuss: Streaming data applications


Week 5
Big Data Management: The "M" in DBMS
Managing big data requires a different approach to database management systems because of the wide variation in data
structure which does not lend itself to traditional DBMSs. There are many applications available to help with big data
management. In these lessons we introduce you to some of these applications and provide insight into how and when they
might be appropriate for your own big data management challenges.

Video · DBMS-based and non-DBMS-based Approaches to Big Data

Reading · Slides: DBMS-based and non-DBMS-based Approaches to Big Data

Video · From DBMS to BDMS

Video · Redis: An Enhanced Key-Value Store

Video · Aerospike: a New Generation KV Store

Video · Semistructured Data – AsterixDB

Video · Solr: Managing Text

Video · Relational Data – Vertica

Reading · Slides: From DBMS to BDMS

Quiz · BDMS Quiz


Week 6
Designing a Big Data Management System for an Online Game
In these lessons we give you the opportunity to learn about big data modeling and management using a fictitious online
game called "Catch the Pink Flamingo".

Reading · A Game by Eglence Inc. : Catch The Pink Flamingo

Other · Let's discuss: Analytical tasks to make Catch the Pink Flamingo better

Peer Review · Designing a Data Model for 'Catch the Pink Flamingo'

Other · Let's discuss: Using the data model for Catch the Pink Flamingo
       

   

   

 English

About the Course


At the end of the course, you will be able to:

*Retrieve data from example database and big data management systems
*Describe the connections between data management operations and the big data processing patterns needed to utilize
them in large-scale analytical applications
*Identify when a big data problem needs data integration
*Execute simple big data integration and processing on Hadoop and Spark platforms

This course is for those new to data science. Completion of Intro to Big Data is recommended. No prior programming
experience is needed, although the ability to install applications and utilize a virtual machine is necessary to complete the
hands-on assignments. Refer to the specialization technical requirements for complete hardware and software specifica-
tions.

Hardware Requirements:
(A) Quad Core Processor (VT-x or AMD-V support recommended), 64-bit; (B) 8 GB RAM; (C) 20 GB disk free. How to find your
hardware information: (Windows): Open System by clicking the Start button, right-clicking Computer, and then clicking
Properties; (Mac): Open Overview by clicking on the Apple menu and clicking “About This Mac.” Most computers with 8 GB
RAM purchased in the last 3 years will meet the minimum requirements.You will need a high speed internet connection
because you will be downloading files up to 4 Gb in size.

Software Requirements:
This course relies on several open-source software tools, including Apache Hadoop. All required software can be download-
ed and installed free of charge (except for data charges from your internet provider). Software requirements include:
Windows 7+, Mac OS X 10.10+, Ubuntu 14.04+ or CentOS 6+ VirtualBox 5+.
Week 1
Welcome to Big Data Integration and Processing
Welcome to the third course in the Big Data Specialization. This week you will be introduced to basic concepts in big data
integration and processing. You will be guided through installing the Cloudera VM, downloading the data sets to be used for
this course, and learning how to run the Jupyter server.

Video · What is in this Course? Reading · Downloading and Installing the


Cloudera VM Instructions (Windows)
Video · Summary of Big Data Modeling and Management Reading · Downloading and Installing the Cloudera
VM Instructions (Mac)

Video · Why is Big Data Processing Different? Reading · Software Installation Frequently Asked
Questions (FAQ)
Other · Getting to know you: Tell us about yourself and Reading · Instructions for Starting Jupyter
why you are taking this course.
Reading · Slides: Summary & Why Is Big Data
Processing Different

Other · Let's discuss: Modeling data in your daily life

Retrieving Big Data (Part 1)


This module covers the various aspects of data retrieval and relational querying. You will also be introduced to the Postgres
database

Video · What is Data Retrieval? Part 1 Reading · Querying Relational Data with Postgres

Video · What is Data Retrieval? Part 2 Video · Querying Relational Data with Postgres

Video · Querying Two Relations

Video · Subqueries

Reading · Slides: What is Data Retrieval?

Other · Let's discuss: Modeling data in your daily life


Week 2
Retrieving Big Data (Part 2)
This module covers the various aspects of data retrieval for NoSQL data, as well as data aggregation and working with data
frames. You will be introduced to MongoDB and Aerospike, and you will learn how to use Pandas to retrieve data from them.

Video · Querying JSON Data with MongoDB Reading · Querying Documents in MongoDB

Video · Aggregation Functions Video · Querying Documents in MongoDB

Other · Let's Discuss: MongoDB Video · Exploring Pandas DataFrames

Video · Querying Aerospike Quiz · Postgres, MongoDB, and Pandas

Quiz · Retrieving Big Data Quiz

Reading · Slides: Querying Data Part 2


Week 3
Big Data Integration
In this module you will be introduced to data integration tools including Splunk and Datameer, and you will gain some
practical insight into how information integration processes are carried out

Video · Overview of Information Integration Reading · Downloading Splunk Enterprise

Video · A Data Integration Scenario Video · Installing Splunk Enterprise on Windows

Video · Integration for Multichannel Customer Analytics Video · Installing Splunk Enterprise on Linux

Quiz · Information Integration - Quiz Reading · Exploring Splunk Queries

Other · Let's Discuss: Big Data Integration Video · Exploring Splunk Queries

Reading · Slides: Information Integration Reading · Optional: Instructions for Splunk Pivot
Tutorial
Video · Optional: Creating Pivot Reports in Splunk
Video · Big Data Management and Processing Using
Splunk and Datameer
Quiz · Hands-On With Splunk
Video · Why Splunk?

Video · Connected Cars with Ford's OpenXC and Splunk

Video · Big Data Management and Processing using Datameer


Week 4
Processing Big Data
This module introduces Learners to big data pipelines and workflows as well as processing and analysis of big data using
Apache Spark.

Video · Big Data Processing Pipelines Reading · WordCount in Spark

Video · Some High-Level Processing Operations in Video · WordCount in Spark


Big Data Pipelines
Video · Aggregation Operations in Big Data Pipelines Quiz · WordCount in Spark

Video · Typical Analytical Operations in Big Data Other · Let's Discuss: Word Count
Pipelines
Other · Let's Discuss: Big Data Pipelines in Your
World
Reading · Big Data Processing Pipelines Slides

Video · Overview of Big Data Processing Systems

Reading · Big Data Workflow Management

Video · The Integration and Processing Layer

Video · Introduction to Apache Spark

Video · Getting Started with Spark

Quiz · Pipeline and Tools

Other · Let's Discuss: Big Data Processing Systems

Reading · Slides for Big Data Processing Tools and Systems


Week 5
Big Data Analytics using Spark
In this module, you will go deeper into big data processing by learning the inner workings of the Spark Core. You will be
introduced to two key tools in the Spark toolkit: Spark MLlib and GraphX.

Video · Spark Core: Programming In Spark using Reading · Exploring SparkSQL and Spark DataFrames
RDDs in Pipelines
Video · Spark Core: Transformations Video · Exploring SparkSQL and Spark DataFrames

Video · Spark Core: Actions Reading · Instructions for Configuring VirtualBox for
Spark Streaming
Reading · Slides for Module 5 Lesson 1 Reading · Analyzing Sensor Data with Spark Streaming

Video · Analyzing Sensor Data with Spark Streaming


Video · Spark SQL

Quiz · SparkSQL and Spark Streaming


Video · Spark Streaming

Video · Spark MLLib

Video · Spark GraphX

Quiz · More on Spark

Reading · Slides for Module 5 Lesson 2


Week 6
Learn By Doing: Putting MongoDB and Spark to Work
In this module you will get some practical hands-on experience applying what you learned about Spark and MongoDB to
analyze Twitter data.

Reading · Let's Analyze Soccer Tweets! Reading · Analyzing Tweets About Countries

Reading · Expressing Analytical Questions as Quiz · Check Your Analysis Results


MongoDB Queries
Quiz · Check Your Query Results

Reading · Exporting Data from MongoDB to a CSV File


       

Machine Learning With Big Data

Current Session: Dec 4

Commitment 5 weeks of study, 3-5t hours/week

Subtitles English

About the Course


Want to make sense of the volumes of data you have collected? Need to incorporate data-driven decisions into your
process? This course provides an overview of machine learning techniques to explore, analyze, and leverage data. You will
be introduced to tools and algorithms you can use to create machine learning models that learn from data, and to scale
those models up to big data problems.

At the end of the course, you will be able to:


• Design an approach to leverage data using the steps in the machine learning process.
• Apply machine learning techniques to explore and prepare data for modeling.
• Identify the type of machine learning problem in order to apply the appropriate set of techniques.
• Construct models that learn from data using widely available open source tools.
• Analyze big data problems using scalable machine learning algorithms on Spark.
WEEK 1
Welcome

Video · Welcome to Machine Learning With Other · Getting to Know You: Tell us about yourself and
Big Data why you are taking this course.
Video · Summary of Big Data Integration and Other · Discussion Forum for Course Content Issues
Processing

Introduction to Machine Learning with Big Data

Reading · Downloading, Installing and Using KNIME


Video · Machine Learning Overview
Reading · Downloading and Installing the Cloudera
Video · Categories Of Machine Learning Techniques VM Instructions (Windows)
Reading · Downloading and Installing the Cloudera
Reading · Slides: Machine Learning Overview VM Instructions (Mac)
and Applications
Reading · Instructions for Downloading Hands On
Other · Machine Learning in Everyday Life Datasets

Reading · Instructions for Starting Jupyter

Video · Machine Learning Process Reading · PDFs of Readings for Week 1 Hands-On

Video · Goals and Activities in the Machine


Learning Process
Video · CRISP-DM

Video · Scaling Up Machine Learning Algorithms

Video · Tools Used in this Course

Quiz · Machine Learning Overview


WEEK 2
Data Exploration

Video · Data Terminology Reading · Description of Daily Weather Dataset

Video · Data Exploration Reading · Exploring Data with KNIME Plots

Video · Data Exploration through Summary Statistics Video · Exploring Data with KNIME Plots

Video · Data Exploration through Plots


Reading · Data Exploration in Spark

Reading · Slides: Data Exploration Overview and Terminology


Video · Data Exploration in Spark

Quiz · Data Exploration


Quiz · Data Exploration in KNIME and Spark Quiz

Reading · PDFs of Activities for Data Exploration Hands


Readings

Data Preparation

Video · Data Preparation Quiz · Data Preparation

Video · Data Quality Reading · Slides: Data Preparation for Machine Learning

Other · Quality Issues with Real Data


Reading · Handling Missing Values in KNIME
Video · Addressing Data Quality Issues
Video · Handling Missing Values in KNIME

Video · Feature Selection


Reading · Handling Missing Values in Spark

Video · Feature Transformation


Video · Handling Missing Values in Spark

Video · Dimensionality Reduction Quiz · Handling Missing Values in KNIME and Spark Quiz

Other · Domain Knowledge in Data Preparation Reading · PDFs for Data Preparation Hands-On Readings
WEEK 3
Classification

Video · Classification Reading · Classification using Decision Tree in KNIME

Video · Building and Applying a Classification Model Video · Classification using Decision Tree in KNIME

Reading · Slides: What is Classification? Reading · Instructions for Changing the Number
of Cloudera VM CPUs

Video · Classification Algorithms Reading · Classification in Spark

Video · k-Nearest Neighbors Video · Classification in Spark

Video · Decision Trees Other · Why Exclude Relative Humidity?

Video · Classification Algorithms Quiz · Classification in KNIME and Spark Quiz

Video · k-Nearest Neighbors Reading · PDFs for Classification Hands-On Readings

Video · Decision Trees

Video · Naïve Bayes

Reading · Slides: Classification Algorithms

Quiz · Classification
WEEK 4
Evaluation of Machine Learning Models

Video · Generalization and Overfitting Reading · Evaluation of Decision Tree in KNIME

Video · Overfitting in Decision Trees Video · Evaluation of Decision Tree in KNIME

Video · Using a Validation Set Reading · Completed KNIME Workflows

Reading · Slides: Overfitting: What is it Reading · Evaluation of Decision Tree in Spark


and how would you prevent it?
Video · Evaluation of Decision Tree in Spark
Video · Metrics to Evaluate Model Performance
Reading · Comparing Classification Results for
Video · Confusion Matrix KNIME and Spark
Quiz · Model Evaluation in KNIME and Spark Quiz
Other · Model Interpretability vs. Accuracy
Reading · PDFs for Evaluation of Machine
Reading · Slides: Model evaluation metrics Learning Models Hands-On Readings
and methods
Quiz · Model Evaluation
WEEK 5
Regression, Cluster Analysis, and Association Analysis

Video · Regression Overview Reading · Description of Minute Weather Dataset

Video · Linear Regression Reading · Cluster Analysis in Spark

Reading · Slides: Regression Video · Cluster Analysis in Spark

Video · Cluster Analysis Quiz · Cluster Analysis in Spark Quiz

Reading · PDFs of Cluster Analysis in Spark


Video · k-Means Clustering Hands-On Readings

Reading · Slides: Cluster Analysis

Other · Clustering Applications

Video · Association Analysis

Video · Association Analysis in Detail

Reading · Slides: Association Analysis

Other · Applications of Association Analysis

Video · Machine Learning With Big Data -


Final Remarks
Quiz · Regression, Cluster Analysis, &
Association Analysis
       

Graph Analytics for Big Data

Current Session: Dec 4

Commitment 5 weeks of study, 3-5 hours/week

Subtitles English

 
Want to understand your data network structure and how it changes under different conditions? Curious to know how to
identify closely interacting clusters within a graph? Have you heard of the fast-growing area of graph analytics and want to
learn more? This course gives you a broad overview of the field of graph analytics so you can learn new ways to model,
store, retrieve and analyze graph-structured data.

After completing this course, you will be able to model a problem into a graph database and perform analytical tasks over
the graph in a scalable manner. Better yet, you will be able to apply these techniques to understand the significance of
your data sets for your own projects.
WEEK 1
Welcome to Graph Analytics
Meet your instructor, Amarnath Gupta and learn about the course objectives.

Video · Welcome to Graph Analytics for Big Data


WEEK 2
Introduction to Graphs
Welcome! This week we will get a first exposure to graphs and their use in everyday life. By the end of the module you will
be able to create a graph applying core mathematical properties of graphs, and identify the kinds of analysis questions one
might be able to ask of such a graph. We hope the you will be inspired as to how graphical representations might enable
you to answer new Big Data problems

Quiz · Introduction to Graphs


Reading · What to learn in this module
Peer Review · Graphs in Everyday Life
Video · What is a Graph?
Other · Optional: What's the most interesting graph
Video · Why Graphs? you reviewed?
Reading · Download Slides for this Module
Other · Let's Discuss: What else do you interact
with that can be represented as a graph?

Video · Why Graphs? Example 1: Social Networking

Video · Why Graphs? Example 2: Biological Networks

Video · Why Graphs? Example 3: Human Information Network Analytics

Video · Why Graphs? Example 4: Smart Cities

Video · The Purpose of Analytics

Video · What are the impact of Big Data's V's on Graphs?


WEEK 3
Introduction to Graphs
Graph Analytics

Reading · What to learn in this module Video · Community Analytics and Local Properties

Video · Focusing On Graph Analytics Techniques


Other · Let's Discuss: What kind of community
analytics question would you like to ask?
Reading · If this module takes a little longer...
that's OK! Video · Global Property: Modularity

Reading · Download All Slides for Module 3


Video · Centrality Analytics

Video · Path Analytics Quiz · Connectivity, Community, and Centrality Analytics

Video · The Basic Path Analytics Question: What is the


Best Path?
Video · Optional Lecture 1: Bi-directional Dijkstra
Video · Applying Dijkstra's Algorithm
Algorithm
Video · Optional Lecture 2: Goal-directed Dijkstra
Video · Inclusion and Exclusion Constraints Algorithm

Other · Let's Discuss: Where do you see path Video · Optional Lecture 3: Power Law Graphs
problems in your life?
Video · Optional Lecture 4: Measuring Graph Evolution

Quiz · Graph Analytics Applications


Video · Optional Lecture 5: Eigenvector Centrality

Video · Optional Lecture 6: Key Player Problems


Video · Connectivity Analytics

Video · Disconnecting a Graph

Video · Connectedness: Indegree and Outdegree


WEEK 4
Graph Analytics Techniques
Welcome to the 4th module in the Graph Analytics course. Last week, we got a glimpse of a number of graph properties
and why they are important. This week we will use those properties for analyzing graphs using a free and powerful graph
analytics tool called Neo4j. We will demonstrate how to use Cypher, the query language of Neo4j, to perform a wide range
of analyses on a variety of graph networks.

Video · Welcome to Graph Analytics Techniques Reading · Basic Queries in Neo4j With Cypher -
Supplementary Resources
Reading · About the Supplementary Resources
Video · Hands-On: Basic Queries in Neo4j With Cypher
- Part 1
Reading · Downloading, Installing, and
Running Neo4j - Supplementary Resources Video · Hands-On: Basic Queries in Neo4j With Cypher
- Part 2
Video · Hands-On: Downloading, Installing, and
Running Neo4j Reading · Path Analytics in Neo4j With Cypher -
Supplementary Resources
Reading · Getting Started With Neo4j -
Supplementary Resources Video · Hands-On: Path Analytics in Neo4j Using Cypher -
Part 1
Video · Hands-On: Getting Started With Neo4j
Video · Hands-On: Path Analytics in Neo4j Using Cypher -
Reading · Adding to and Modifying a Graph - Part 2
Supplementary Resources Reading · Connectivity Analytics in Neo4j with Cypher
Video · Hands-On: Modifying a Graph With Neo4j - Supplementary Resources

Reading · Download datasets used in this Quiz · Quiz: Graph Analytics With Neo4j
Graph Analytics with Neo4j
Reading · Assignment: Practicing Graph Analytics in
Reading · Importing Data Into Neo4j - Neo4j With Cypher
Supplementary Resources
Quiz · Assessment Questions on 'Practicing Graph
Video · Hands-On: Importing Data Into Neo4j Analytics in Neo4j With Cypher'

Reading · FAQ
Reading · Download All Neo4j Supplementary
Resources (PDFs)
WEEK 5
Computing Platforms for Graph Analytics
In the last two modules we have learned about graph analytics and graph data management. This week we will study how
they come together. There are programming models and software frameworks created specifically for graph analytics. In
this module we'll give an introductory tour of these models and frameworks. We will learn to implement what you learned
in Week 2 and build on it using GraphX and Giraph.

Video · Introduction: Large Scale Graph Processing Reading · Hands On: Building a Degree Histogram
Reading
Video · A Parallel Programming Model for Graphs Video · Hands On: Plot the Degree Histogram

Video · Pregel: The System That Changed Graph Reading · Hands On: Plot the Degree Histogram Reading
Processing

Video · Giraph and GraphX Video · Hands On: Network Connectedness and Clustering
Components
Video · Beyond Single Vertex Computation Reading · Hands On: Network Connectedness and
Clustering Components Reading

Video · Hands On: Joining Graph Datasets


Video · Introduction to GraphX: Hands-On
Demonstrations Reading · Hands On: Joining Graph Datasets Reading
Reading · Datasets and Libraries for Example of
Analytics Hands On
Quiz · Using GraphX
Reading · Download all of the readings for this
section as a PDF
Reading · Assignment: Practicing Graph Analytics in Neo4j
Video · Hands On: Building a Graph With Cypher
Quiz · Assessment Questions on 'Practicing Graph
Reading · Hands On: Building a Graph Reading Analytics in Neo4j With Cypher'

Video · Hands On: Building a Degree Histogram


Reading · Download All Neo4j Supplementary Resources
(PDFs)
       

Big Data - Capstone Project

    

 English

About the Capstone Project


Welcome to the Capstone Project for Big Data! In this culminating project, you will build a big data ecosystem using tools
and methods form the earlier courses in this specialization. You will analyze a data set simulating big data generated from
a large number of users who are playing our imaginary game "Catch the Pink Flamingo". During the five week Capstone
Project, you will walk through the typical big data science steps for acquiring, exploring, preparing, analyzing, and reporting.
In the first two weeks, we will introduce you to the data set and guide you through some exploratory analysis using tools
such as Splunk and Open Office. Then we will move into more challenging big data problems requiring the more advanced
tools you have learned including KNIME, Spark's MLLib and Gephi. Finally, during the fifth and final week, we will show you
how to bring it all together to create engaging and compelling reports and slide presentations. As a result of our collabora-
tion with Splunk, a software company focus on analyzing machine-generated big data, learners with the top projects will be
eligible to present to Splunk and meet Splunk recruiters and engineering leadership.
WEEK 1
Simulating Big Data for an Online Game
This week we provide an overview of the Eglence, Inc. Pink Flamingo game, including various aspects of the data which the
company has access to about the game and users and what we might be interested in finding out.

 ·         ·  …  ­€

 ·  …­  ƒ  


 ·†ƒ ƒ „…   ­€‡ 

 · ‚  ‚ ƒ „


 ·†ƒ ƒ „…€   

 ·       ­€

 ·…  ­€

                         
 
‹ ˆ‚„  „­„     ˆ  …   

 ·„      ·† ƒ „…­

 ·   €       ·“  ­€”ˆ„­

 ·    ­

 ·€  ­

‰Š·ˆ­

 ƒ „·ˆ   ˆ


WEEK 2
Data Classification with KNIME
This week we do some data classification using KNIME.

 · ƒ „  Œ     Ž‹‡  ·­‘„†ƒ ƒ „…    Ž‹‡

 · ƒ „       Ž‹‡


 · … _ ƒ

 · ‚  ‚ ƒ „


 ƒ „· …Ž‹‡ …    
  ­€
WEEK 3
Clustering with Spark
This week we do some clustering with Spark.

Reading · Informing business strategies Reading · Practice with PySpark MLlib Clustering
based on client base
Peer Review · Recommending Actions from
Other · Is there only “one way” to cluster a Clustering Analysis
client base?

Other · How many clusters?

Other · What kind of criteria might provide actionable


information for Eglence Inc.?
WEEK 4
Graph Analytics of Simulated Chat Data With Neo4j
This week we apply what we learned from the 'Graph Analytics With Big Data' course to simulated chat data from Catch the
Pink Flamingos using Neo4j. We analyze player chat behavior to find ways of improving the game.

Reading · Understanding the Peer Review · Graph Analytics With Chat Data Using
Simulated Chat Data Generated by the Scripts Neo4j

Other · Is there only “one way” to


cluster a client base?

Reading · Graph Analytics of Catch the Pink Flamingo


Chat Data Using Neo4j
WEEK 5
Reporting and Presenting Your Work

Video · Week 5: Bringing It All Together

Reading · Final project preparation


WEEK 6
Final Submission

Peer Review · Final Project Practice Peer Review · Optional 3-minute video: Splunk opportunity

Video · Congratulations! Some Final Words... Reading · Part 2: Help us connect your video to your LinkedIn profile

You might also like