You are on page 1of 13

Advance Big Data Science

using Python-R-Hadoop-
Spark
A comprehensive, job-oriented
training program crafted by experts

Disclaimer: This material is protected under copyright act AnalytixLabs , 2011-2016. Unauthorized use and/ or duplication of this material or any part of this material
including data, in any form without explicit and written permission from AnalytixLabs is strictly prohibited. Any violation of this copyright will attract legal actions
About AnalytixLabs
AnalytixLabs is a capability building and training solutions firm led by McKinsey, IIM, ISB and IIT alumni with deep industry experience
and a flair for coaching. We are focused at helping our clients develop skills in basic and advanced analytics to enable them to emerge as
Industry Ready professionals and enhance career opportunities. AnalytixLabs has been also featured as top institutes by prestigious
publications like Analytics India Magazine and Higher Education Review, since 2013.

Bottom line
Job-oriented training
Faculty
Lucrative job prospects in high
growth domain
Seasoned analytics professionals
Content
Together we have 30 + years of Support for relevant
World class course structure certifications and diplomas
experience with prestigious firms,
Approach like McKinsey, KPMG, Deloitte
Career counseling and planning
Surpasses industry requirements and AOL
80-20 focus on practical & theory
Cater to Standard certifications Regular sessions by industry Value for money with high return
on investment
Personal attention and Individual experts
counselling High quality course material and
real life case studies
Industry best practices
Global Data science and Big Data skill gap

McKinsey Global Institute estimates a shortage of nearly 1.7 million big data talents by 2018. This includes a
shortage of 140,000 to 190,000 workers with deep technical and analytical expertise, and a shortage of 1.5
million managers and analysts equipped to work with and use big data outputs
Candidates trained by us are working in leading companies across
industries
Program Objective

Advance Big Data Science using Python-R-Hadoop-Spark program aims to provide its students an international, wide-spectrum
qualification for job-readiness and seamless absorption in Big Data job roles.

The program will expose the students and professionals to the roles of Big Data Analysts who have:

Ability to translate business problem into analytics problem


Understanding of storage, retrieval and mining of data
Possess Outcome-Oriented and Global Industry-Specific expertise in Critical Data Analytics and Data Management Skills
Hands-on practical skills on exploratory analysis using Hadoop, and prescriptive and predictive analysis using R Spark-Python
Application of analytics in various domains, like ecommerce, Retail, Telecom, BFSI etc.
Skills to leverage analytics to drive smart business decisions

Crafted by team of experts and maintains a balance between theoretical concepts and practical applications
Advance Big Data Science is a comprehensive program with following
modules, weekly assignments and case studies
Python 30 hours + Practice exercises
Module 1
Basic data handling, data manipulation, descriptive analytics and visualization

Hadoop & its components 15 hours + Practice


Module 2 Introduction to the ecosystem and focus on core components Pig, Hive, MapReduce,
HDFS, Impala Hbase and other Apache projects

Spark 45 hours + Practice


Module 3 Working on Spark and connecting to Hive and Python for advance analytics and
Machine Learning

R (video based) 12 hours


Module 4 Data manipulation, descriptive analytics and visualization using R

Crafted by team of experts and maintains a balance between theoretical concepts and practical applications
Advance Big Data Science using Python-R-Hadoop-Spark (1/3)
Total Duration: 90 hours + Practice
Introduction to Data Science Database Input (Connecting to database) Python: Basic statistics
What is Data Science? Viewing Data objects - subsetting, methods Basic Statistics - Measures of Central Tendencies and Variance
Data Science Vs. Analytics vs. Data warehousing, OLAP, MIS Exporting Data to various formats Building blocks - Probability Distributions - Normal distribution -
Reporting Central Limit Theorem
Relevance in industry and need of the hour Python: Data Manipulation cleansing Inferential Statistics -Sampling - Concept of Hypothesis Testing
Type of problems and objectives in various industries Cleansing Data with Python Statistical Methods - Z/t-tests (One sample, independent, paired),
How leading companies are harnessing the power of Data Data Manipulation steps(Sorting, filtering, duplicates, merging, Anova, Correlation and Chi-square
Science? appending, subsetting, derived variables, sampling, Data type
Different phases of a typical Analytics/Data Science projects conversions, renaming, formatting etc) Python: Polyglot Programming
Data manipulation tools(Operators, Functions, Packages, Making Python talk to other languages and database systems
Python: Introduction & Essentials control structures, Loops, arrays etc) How do R and Python play with each other, why it's essential to
Overview of Python- Starting Python Python Built-in Functions (Text, numeric, date, utility know both
Introduction to Python Editors & IDE's(Canopy, pycharm, functions)
Jupyter, Rodeo, Ipython etc) Python User Defined Functions Hadoop: Introduction to Hadoop & Ecosystem
Custom Environment Settings Stripping out extraneous information Introduction to Hadoop
Concept of Packages/Libraries - Important packages(NumPy, Normalizing data Hadoopable Problems - Uses of Big Data analytics in various
SciPy, scikit-learn, Pandas, Matplotlib, etc) Formatting data industries like Telecom, E- commerce, Finance and Insurance etc
Installing & loading Packages & Name Spaces Important Python Packages for data manipulation (Pandas, Problems with Traditional Large-Scale Systems & Existing Data
Data Types & Data objects/structures (Tuples, Lists, Numpy etc) analytics Architecture
Dictionaries) Key technology foundations required for Big Data
List and Dictionary Comprehensions Python: Data Analysis Visualization Comparison of traditional data management systems with Big
Variable & Value Labels Date & Time Values Introduction exploratory data analysis Data management systems
Basic Operations - Mathematical - string - date Descriptive statistics, Frequency Tables and summarization Evaluate key framework requirements for Big Data analytics
Reading and writing data Univariate Analysis (Distribution of data & Graphical Analysis) Apache projects in the Hadoop Ecosystem
Simple plotting Bivariate Analysis(Cross Tabs, Distributions & Relationships, Hadoop Ecosystem & Hadoop 2.x core components
Control flow Graphical Analysis) Explain the relevance of real-time data
Debugging Creating Graphs- Bar/pie/line Explain how to use Big Data and real-time data as a Business
Code profiling chart/histogram/boxplot/scatter/density etc) planning tool
Important Packages for Exploratory Analysis(NumPy Arrays,
Python: Accessing/Importing and Exporting Data Matplotlib, Pandas and scipy.stats etc)
Importing Data from various sources (Csv, txt, excel, access
etc)
Advance Big Data Science using Python-R-Hadoop-Spark (2/3)
Total Duration: 90 hours + Practice
Hadoop core components- HDFS Hadoop Data Analysis Tools: Impala Distributed Persistence
HDFS Overview & Data storage in HDFS Introduction to Impala & Architecture Spark Streaming Overview(Example: Streaming Word Count)
Get the data into Hadoop from local machine(Data Loading How Impala executes Queries and its importance.
Techniques) - vice versa Hive vs. PIG vs. Impala Spark: Spark meets Hive
Extending Impala with User Defined functions Analyze Hive and Spark SQL Architecture
Hadoop core components- MapReduce (YARN) Improving Impala Performance Analyze Spark SQL
Map Reduce Overview (Traditional way Vs. MapReduce way) Context in Spark SQL
Concept of Mapper & Reducer Hadoop Data Analysis Tools: Hbase (NOSQL Database) Implement a sample example for Spark SQL
Understanding Map reduce program skeleton Introduction to NoSQL Databases, types, and Hbase Integrating hive and Spark SQL
Running MapReduce job in Command line HBase v/s RDBMS, HBase Components, HBase Architecture Support for JSON and Parquet File Formats Implement Data
HBase Cluster Deployment Visualization in Spark
Hadoop Data Analysis Tools: Hadoop-PIG Loading of Data
Introduction to PIG - MapReduce Vs Pig, Pig Use Cases Hadoop: Introduction to other Apache Projects Hive Queries through Spark
Pig Latin Program & Execution Introduction to Zookeeper/Oozie/Sqoop/Flume Performance Tuning Tips in Spark
Pig Latin : Relational Operators, File Loaders, Group Shared Variables: Broadcast Variables & Accumulators
Operator, COGROUP Operator, Joins and COGROUP, Union, SPARK: Introduction
Diagnostic Operators, Pig UDF Introduction to Apache Spark Data Science using SPARK Python
Use Pig to automate the design and implementation of Streaming Data Vs. In Memory Data Hadoop - Python Integration
Map Reduce Vs. Spark Spark - Python Integration (PySpark)
MapReduce applications
Modes of Spark
Data Analysis using PIG
Spark Installation Demo Spark -Python: Machine Learning -Predictive Modeling Basics
Hadoop Data Analysis Tools: Hadoop-Hive Overview of Spark on a cluster Introduction to Machine Learning & Predictive Modeling
Introduction to Hive - Hive Vs. PIG - Hive Use Cases Spark Standalone Cluster Types of Business problems - Mapping of Techniques
Discuss the Hive data storage principle Major Classes of Learning Algorithms -Supervised vs
Spark: Spark in practice Unsupervised Learning,
Explain the File formats and Records formats supported by
Invoking Spark Shell Different Phases of Predictive Modeling (Data Pre-processing,
the Hive environment
Creating the Spark Context Sampling, Model Building, Validation)
Perform operations with data in Hive
Loading a File in Shell Overfitting (Bias-Variance Trade off) & Performance Metrics
Hive QL: Joining Tables, Dynamic Partitioning, Custom
Performing Some Basic Operations on Files in Spark Shell Types of validation(Bootstrapping, K-Fold validation etc)
Map/Reduce Scripts
Building a Spark Project with sbt
Hive Script, Hive UDF
Running Spark Project with sbt
Caching Overview
Advance Big Data Science using Python-R-Hadoop-Spark (3/3)
Total Duration: 90 hours + Practice
Spark -Python: Machine Learning in Practice R Built-in Functions (Text, Numeric, Date, utility)
Linear Regression R User Defined Functions
Logistic Regression R Packages for data manipulation(base, dplyr, plyr, reshape,
Segmentation - Cluster Analysis (K-Means) car, sqldf, etc)
Decision Trees (CHAID/CART/CD 5.0)
Artificial Neural Networks(ANN) R: Data Analysis Visualization - (Video Based)
Support Vector Machines(SVM) Introduction exploratory data analysis
Ensemble Learning (Random Forest, Bagging & boosting) Descriptive statistics, Frequency Tables and summarization
Other Techniques (KNN, Nave Bayes, LDA/QDA etc) Univariate Analysis (Distribution of data & Graphical Analysis)
Important Packages for Machine Learning (Sci Kit Learn, Bivariate Analysis(Cross Tabs, Distributions & Relationships,
scipy.stats etc) Graphical Analysis)
Creating Graphs- Bar/pie/line
R: Introduction - Data Importing/Exporting - (Video Based) chart/histogram/boxplot/scatter/density etc)
Introduction R/R-Studio - GUI R Packages for Exploratory Data Analysis(dplyr, plyr, gmodes,
Concept of Packages - Useful Packages (Base & other car, vcd, Hmisc, psych, doby etc)
packages) in R R Packages for Graphical Analysis (base, ggplot, lattice,etc)
Data Structure & Data Types (Vectors, Matrices, factors, Data
frames, and Lists) Integration of Hadoop with R
Importing Data from various sources
Database Input (Connecting to database)
Exporting Data to various formats)
Viewing Data (Viewing partial data and full data)
Variable & Value Labels Date Values

R: Data Manipulation - (Video Based)


Data Manipulation steps (Sorting, filtering, duplicates,
merging, appending, subsetting, derived variables, sampling,
Data type conversions, renaming, formatting, etc)
Data manipulation tools(Operators, Functions, Packages,
control structures, Loops, arrays, etc)
Course completion and career assistance
Course completion & Certification criteria What is included in career assistance?

You shall be awarded an AnalytixLabs certificate only Post successful course completion, candidates can seek
post the submission and evaluation of mandatory course assistance from AnalytixLabs for profile building. A team
project work. These will be provided as a part of the of seasoned professionals will help you based on your
training. overall education background and work experience. This
will be followed by interview preparation along with
There is no pass/fail for these assignments and projects . mock interviews (if required)
Our objective is to ensure that trainees get strong hands-
on experience so that they are well-prepared for job Job referrals are based on the requirements we get from
interviews along with performance at their jobs. various organizations, HR consultants and large pool of
AnalytixLabs ex-students working in various companies.
Incase the assignments and projects are not up-to-the-
mark, trainees are welcome to take help and support for No one can truthfully provide job guarantee, particularly
improvisation. for good quality job profiles in Analytics. However, most
of our students do get multiple interview calls and good
While weekly schedule is shared with trainees for regular career options based on the skills they learn during
assignments, candidates get 3 months, post course training. For this there will be continuous support from
completion, to submit their final assignment and our side for as long as required.
projects.
Time and investment
Full interactive online training: 90 hours live training + 12 hours video based training + Practice (~80 hours),
INR 40,000 + 15% ST / $1500 (foreign nationals) including taxes
Self-paced video training: 100 hours + Practice, INR 35,000 + 15% ST / $1200 (foreign nationals) including taxes

Timing: 6 hours per weekend live training (Saturday & Sunday 3 hours each) + Practice

Training mode: Fully interactive live online class


(In addition to the above, you will also get access to the recordings for future reference and self study)

Components: Learning Management System access for courseware like class recordings - study material, Industry-
relevant project work

Certification: Participants will be awarded a certificate on successful completion of the stipulated requirements
including an evaluation
We provide trainings both in fully interactive live online and classroom*
mode
Fully interactive
live online class
with personal
attention
Access to quality
Saves training and 24x7
commuting time practice
and resources in sessions
todays chaotic available at the
world comfort of your
Ensures place
best use of
time and
Delivered resources
Studies prove
lectures are
that online
recorded and
education beats
can be replayed
the conventional
by individuals as
classroom
per their needs One of strongest
global trends in
education, both
in developing
and developed
countries

*Classroom only available at Gurgaon center


Contact Us

Visit us on: http://www.analytixlabs.in/

For course registration, please visit: http://www.analytixlabs.co.in/course-registration/

For more information, please contact us: http://www.analytixlabs.co.in/contact-us/


Or email: info@analytixlabs.co.in
Call us we would love to speak with you: (+91) 9555219007

Join us on:
Twitter - http://twitter.com/#!/AnalytixLabs
Facebook - http://www.facebook.com/analytixlabs
LinkedIn - http://www.linkedin.com/in/analytixlabs
Blog - http://www.analytixlabs.co.in/category/blog/

You might also like