Professional Documents
Culture Documents
com/analyticsvidhya)
(https://plus.google.com/+Analyticsvidhya/posts)
(https://www.linkedin.com/groups/Analytics-Vidhya-Learn-everything-about-5057165)
(https://www.analyticsvidhya.com/)
(https://datahack.analyticsvidhya.com/contest/skilltest-
(https://www.analyticsvidhya.com/datafest-2017/)
probability/)
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/CATEGORY/MACHINE-LEARNING/)
This is the reason I thought of writing this article. This article provides you a guided
path to start your journey to learn big data and will help you land a job in big data
(https://datahack.analyticsvidhya.com/contest/skilltest-
industry. The biggest challenge we face is identifying the right role as per our interest
probability/)
and skillsets.
To tackle this problem, I have explained each big data role in detail and also
consideringdifferent job roles of engineers and computer science graduates.
I have tried to answer all your questions which you have or will encounter while
learning big data.To help you choose a path according to your interest I have added a
tree map which will help you identify the right path.
Table of Content
(https://datahack.analyticsvidhya.com/contest/avdatafest-
skillpower-time-series/)
1. How to get started?
2. What roles are up for grabs in the big data industry?
3. What is your profile, and where do you fit in?
4. Mapping roles to Big Data profiles
5. How to be a big data engineer?
What is the big data jargon?
Systems and architecture you need to know
Learn to design solutions and technologies
6. Big Data Learning Path
7. Resources
1. How to get started?
Big data is, Do I learn Hadoop, Distributed computing, Kafka, NoSQL or Spark?
Well, I always have one answer: It depends on what you actually want to do.
So, lets approach this problem in a methodical way. We are going to go through this
learning path step by step.
The Big data engineering revolves around the design, deployment, acquiring and
maintenance (storage) of a large amount of data. The systems which Big data
engineers are required to design and deploy make relevant data available to various
consumer-facing and internal applications.
(https://datahack.analyticsvidhya.com/contest/avdatafest-
skillpower-time-series/)
While Big Data Analytics revolves around the concept of utilizing the large amounts of
data from the systems designed by big data engineers. Big Data analytics involves
analyzing trends, patterns and developing various classification, prediction &
forecasting systems.
Thus, in brief, Big data analytics involves advanced computations on the data.
Whereas big data engineering involves the designing and deployment of systems &
setups on top of which computation must be performed.
3.What is your profile and where do you fit in?
identify which profile is suitable for you. So that, you can analyze where you may fit in
the industry.
Educational Background
(This includes interests and doesnt necessarily point towards your college education).
1. Computer Science
(https://datahack.analyticsvidhya.com/contest/skilltest-
2. Mathematics
probability/)
Industry Experience
1. Fresher
2. Data Scientist
3. Computer Engineer (work in Data related projects)
Thus, by using the above categories you can define your profile as follows:
Eg 1: I am a computer science grad with no experience with fairly solid math skills.
(https://datahack.analyticsvidhya.com/contest/avdatafest-
You have an interest in Computer science or Mathematics but with n o prior
skillpower-time-series/)
experience you will be considered a Fresher.
Your interest is in computer science and you are fit for a role of a Computer Engineer
(data related projects).
You have an interest inMathematics and fit for a role of a Data Scientist.
So, go ahead and define your profile.
(The profiles we define here are essential in finding your learning path in the big data
industry).
(https://datahack.analyticsvidhya.com/contest/avdatafest-
skillpower-time-series/)
5. How to be a big data Engineer?
Let us first define what a big data Engineer needs to know and learn to be considered
for a position in the industry. The first and foremost step is to first identify your needs.
You cant just start studying big data withoutidentifying your needs. Otherwise, you
would just be shooting in that dark.
In order to define your needs, you must know the common big data jargon.So lets
find out what does big data actually means?
5.1 The Big Data jargon
Source Throughput: Defines at what rate data can be updated and transformed into
the system. (Types: H/M/L)
(This is my personal solution, you may come up with a more elegant solution if you do
please share below.)
seamlessly integrate data from various sources to make it available all the time, but it
must also be designed in a way to make the analysis of the data and utilization of data
for developing applications easy, fast and always available (Intelligent dashboard in
this case).
Now that we know what our end goals are, let us try to formulate our requirements in
more formal terms.
Completeness: Incomplete
(https://datahack.analyticsvidhya.com/contest/skilltest-
probability/)
Precision: Exact
As multiple data sources are being integrated, it is important to note that different data
will enter the system at different rates. For example, the weblogs will be available in a
continuous stream with a high level of granularity.
(https://datahack.analyticsvidhya.com/contest/avdatafest-
skillpower-time-series/)
Based on the above analysis of our requirements for the system we can recommend
the following big data setup.
6.Big Data Learning Path
(https://datahack.analyticsvidhya.com/contest/skilltest-
Now, you have an understanding of the big data industry, the different roles and
probability/)
As we know the big data domain is littered with technologies. So, it is quite crucial that
you learn technologies that are relevant and aligned with your big data job role. This is
a bit different than any conventional domains like data science and machine learning
where you start at something and endeavor to complete everything in the field.
Below you will find a tree which you should traverse in order to find your own path.
Even though some of the technologies in the tree are pointed to be data scientists
forte but it is always good to know all the technologies till the leaf nodes if you embark
(https://datahack.analyticsvidhya.com/contest/avdatafest-
on a path. The tree is derived from the lambda architectural paradigm.
skillpower-time-series/)
(https://datahack.analyticsvidhya.com/contest/skilltest-
probability/)
(https://datahack.analyticsvidhya.com/contest/avdatafest-
skillpower-time-series/)
With the help of this tree map, you can select the path as per your interest and goals.
One of the essential concepts that any engineer who wants to deploy applications
must know is Bash Scripting. You must be very comfortable with linux and bash
scripting. This is the essential requirement for working with big data.
At the core, most of the big data technologies are written in Java or Scala. But dont
worry, if you do not want to code in these languages ou can choose Python or R
because most of the big data technologies now support Python and R extensively.
Thus, you can start with any of the above-mentioned languages. I would recommend
choosing either Python or Java.
(https://datahack.analyticsvidhya.com/contest/skilltest-
probability/)
going to take you seriously if you havent worked with big data on the cloud. Try
practicing with small datasets on AWS, softlayer or any other cloud provider. Most of
them have a free tier so that students can practice. You can skip this step for the time
being if you like but be sure to work on the cloud before you go for any interview.
Next, you need to learn about a Distributed file system. The most popular DFS out
there is Hadoop distributed file system. At this stage you can also study about some
NoSQL database you find relevant to your domain. The diagram below helps you in
selecting a NoSQL database to learn based on the domain you are interested in.
(https://datahack.analyticsvidhya.com/contest/avdatafest-
The path until now are the mandatory basics which every big data engineer must
skillpower-time-series/)
know.
Now is the point that you decide whether you would like to work with data streams or
dormant large volumes of data. This is the choice between two of the four Vs that are
used to define big data (Volume, Velocity, Variety and Veracity).
So lets say you have decided to work with data streams to develop real-time or near-
realtime analysis systems. Then you should take the Kafka path. Else you take the
Mapreduce path. And thus you follow the path that you create. Do note that, in the
Mapreduce path you do not need to learn pig and hive. Studying only one of them is
Did the last step (#7) baffle you! Well truth be told, no application has only stream
processing or slow velocity delayed processing of data. Thus, you technically need to
be a master at executing the complete lambda architecture.
Also, note that this is not the only way you can learn big data technologies. You can
create your own path as you go along. But this is a path which can be used by
anybody.
If you want to enter the big data analytics world you could follow the same path but
(https://datahack.analyticsvidhya.com/contest/avdatafest-
dont try to perfect everything.
skillpower-time-series/)
For a Data Scientist capable of working with big data you need to add a couple of
machine learning pipelines to the tree below and concentrate on the machine learning
pipelines more than the tree provided below. But we can discuss ML pipeline later.
Add a NoSQL database of choice based on the type of data you are working with in
the above tree.
As you can see there are loads of NoSQL databases to choose from. So it always
depends on the type of data that you would be working with.
And providing a definitive answer to what type of NoSQL database you need to take
into account your system requirements like latency, availability, resilience, accuracy
and of course the type of data that you are dealing with.
(https://datahack.analyticsvidhya.com/contest/skilltest-
probability/)
1.Bash Scripting
2.Python
(https://datahack.analyticsvidhya.com/contest/avdatafest-
skillpower-time-series/)
3. Java
4.Cloud
5. HDFS
(https://datahack.analyticsvidhya.com/contest/skilltest-
probability/)
6. Apache Zookeeper
Book
7. Apache Kafka
(https://datahack.analyticsvidhya.com/contest/avdatafest-
skillpower-time-series/)
8. SQL
9. Hive
10. Pig
(https://datahack.analyticsvidhya.com/contest/skilltest-
probability/)
Book-
(https://datahack.analyticsvidhya.com/contest/avdatafest-
12. Apache Kinesis
skillpower-time-series/)
Book
14. Apache Spark Streaming
End Notes
I hope you enjoyed reading this article. With the help of this learning path, you will be
able to embark upon your journey in big data industry. I have covered most of the
major concepts which you will require to land a job.
If you have any doubts or questions, feel free to post them below.
(https://datahack.analyticsvidhya.com/contest/skilltest-
probability/)
Share this:
(https://datahack.analyticsvidhya.com/contest/avdatafest-
skillpower-time-series/)
RELATED
(https://datahack.analyticsvidhya.com/contest/skilltest-
probability/)
(HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/TAG/SALARY-OF-BIG-DATA-ANALYST/)
Next Article
Beginners Guide on Web Scraping in R (using rvest) with hands-on example
(https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-
hands-on-knowledge/)
Previous Article
How I created a package in R & published it on CRAN / GitHub (and you can too)?
(https://www.analyticsvidhya.com/blog/2017/03/create-packages-r-cran-github/)
(https://datahack.analyticsvidhya.com/contest/skilltest-
probability/)
(https://www.analyticsvidhya.com/blog/author/saurabh-jaju2/)
Author
saurabh.jaju2
(https://www.analyticsvidhya.com/blog/author/saurabh-jaju2/)
(https://datahack.analyticsvidhya.com/contest/avdatafest-
Saurabh is a data scientist and software engineer skilled at analyzing variety of
skillpower-time-series/)
datasets and developing smart applications. He is currently pursuing a Masters
degree in Information and Data Science from University of California,Berkeley and is
passionate about developing data science based smart resource management
systems.
9 COMMENTS
Syed Ishrathullah says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2017/03/BIG-DATA-LEARNING-PATH-FOR-ALL-ENGINEERS-AND-DATA-SCIENTISTS-OUT-THERE/?
MARCH 24, 2017 AT 9:28 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2017/03/BIG-DATA-LEARNING-PATH-
REPLYTOCOM=125473#RESPOND)
FOR-A LL-ENGINEERS-AND-DATA-SCIENTISTS-OUT-THERE/#COMMENT-125473)
saurabh.jaju2 says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2017/03/BIG-DATA-LEARNING-PATH-FOR-ALL-ENGINEERS-AND-DATA-SCIENTISTS-OUT-THERE/?
MARCH 25, 2017 AT 10:48 A M (HTTPS://WWW. ANALYTICSVIDHYA.COM/BLOG/2017/03/BIG- DATA-LEARNING-PATH-
REPLYTOCOM=125522#RESPOND)
FOR-A LL-ENGINEERS-AND-DATA-SCIENTISTS-OUT-THERE/#COMMENT-125522 )
Hi Syed,
(https://datahack.analyticsvidhya.com/contest/skilltest-
Glad you liked the article. I am no expert in IT Security, so cannot be certain about
probability/)
all applications of big data in IT Security. But weblog and network traffic analysis
(my favorite!) is a very interesting field (this involves the Kafka and stream analysis
track). Also smart firewalls, and pattern analysis to develop smart anti malware
softwares. Microsoft had organized a big data analysis competition for malware
classification on kaggle, that was an interesting application.
Regards,
Saurabh
Parag k says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2017/03/BIG-DATA-LEARNING-PATH-FOR-ALL-ENGINEERS-AND-DATA-SCIENTISTS-OUT-THERE/?
MARCH 25, 2017 AT 3:50 AM (HTTPS://WWW.A NALYTICSVIDHYA .COM/BLOG/2017/03/BIG-DATA-LEARNING-PATH-
(https://datahack.analyticsvidhya.com/contest/avdatafest-
REPLYTOCOM=125510#RESPOND)
FOR-A LL-ENGINEERS-AND-DATA-SCIENTISTS-OUT-THERE/#COMMENT-125510)
skillpower-time-series/)
Thanks for details .. most helpful document with reference link
venkatesh says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2017/03/BIG-DATA-LEARNING-PATH-FOR-ALL-ENGINEERS-AND-DATA-SCIENTISTS-OUT-THERE/?
MARCH 25, 2017 AT 4:29 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2 017/03/BIG-DATA-LEARNING-PATH-
REPLYTOCOM=125511#RESPOND)
FOR-A LL-ENGINEERS-AND-DATA-SCIENTISTS-OUT-THERE/#COMMENT-125511)
saurabh.jaju2 says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2017/03/BIG-DATA-LEARNING-PATH-FOR-ALL-ENGINEERS-AND-DATA-SCIENTISTS-OUT-THERE/?
MARCH 25, 2017 AT 10:31 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2017/03/BIG-DATA-LEARNING-PATH-
REPLYTOCOM=125520#RESPOND)
FOR-A LL-ENGINEERS-AND-DATA-SCIENTISTS-OUT-THERE/#COMMENT-125520)
Hi Prof Taran,
I dont know a lot about civil engineering, so cannot comment about the complete
(https://datahack.analyticsvidhya.com/contest/skilltest-
field. But here are a couple of interesting projects that I know about
probability/)
1. Using big data analytics for efficiently managing and provisioning heavy
equipment and other essentials while working on large construction projects.( A
friend of mine is working on it).
2. Using big data analytics for freshwater management. (I am working on this).
But essentially any supply chain management, and resource allocation and
management problem can be an important problem solvable using big data
analytics.
Regards,
Saurabh
(https://datahack.analyticsvidhya.com/contest/avdatafest-
skillpower-time-series/)
anup@AV says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2017/03/BIG-DATA-LEARNING-PATH-FOR-ALL-ENGINEERS-AND-DATA-SCIENTISTS-OUT-THERE/?
MARCH 28, 2017 A T 1:18 PM (HTTPS://WWW.ANALYTICSVIDHYA. COM/BLOG/2017/0 3/BIG-DATA-LEARNING-PATH-
REPLYTOCOM=125687#RESPOND)
FOR-A LL-ENGINEERS-AND-DATA-SCIENTISTS-OUT-THERE/#COMMENT-125687)
Syed says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2017/03/BIG-DATA-LEARNING-PATH-FOR-ALL-ENGINEERS-AND-DATA-SCIENTISTS-OUT-THERE/?
REPLYTOCOM=125706#RESPOND)
MARCH 28, 2017 A T 5:43 PM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2017/03/BIG-DA TA-LEARNING-PATH-
FOR-A LL-ENGINEERS-AND-DATA-SCIENTISTS-OUT-THERE/#COMMENT-125706)
Hi Saurabh,
A couple of points.
1) You havent mentioned SAS. Any reason why as it holds good sway in Big data
anaytics ?
2) I also wanted your view on the timelines needed to pursue these courses . for
instance ..someone like me who is an Info Sec Manager wanting to get into Big
Data will try following your article ( as i loved it ) and go for
Regards
Syed
vaishnavi says:
REPLY (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2017/03/BIG-DATA-LEARNING-PATH-FOR-ALL-ENGINEERS-AND-DATA-SCIENTISTS-OUT-THERE/?
MARCH 31, 2017 AT 4:26 AM (HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2017/03/BIG-DATA-LEARNING-PA TH-
REPLYTOCOM=125913#RESPOND)
(https://datahack.analyticsvidhya.com/contest/avdatafest-
FOR-A LL-ENGINEERS-AND-DATA-SCIENTISTS-OUT-THERE/#COMMENT-125913)
skillpower-time-series/)
hii
i am it engineer 2016 passed out also completed two months hadoop course from
one institute.now im thinking for post graduation program in big data and
analytics, should i go for it or should wait and get some experience
LEAVE A REPLY
Connect with:
(https://www.analyticsvidhya.com/wp-login.php?
action=wordpress_social_authenticate&mode=login&provider=Facebook&redirect_to=https%3A%2F%2Fw
data-learning-path-for-all-engineers-and-data-scientists-out-there%2F)
Comment
(https://datahack.analyticsvidhya.com/contest/skilltest-
probability/)
Email (required)
Website
SUBMIT COMMENT
(https://datahack.analyticsvidhya.com/contest/avdatafest-
skillpower-time-series/)
(https://datahack.analyticsvidhya.com/contest/use-of-analytics-in-telecom-industry/)
(https://datahack.analyticsvidhya.com/contest/skilltest-
probability/)
(http://www.greatlearning.in/great-lakes-
pgpba?utm_source=avm&utm_medium=avmbanner&utm_campaign=pgpba)
(https://datahack.analyticsvidhya.com/contest/avdatafest-
skillpower-time-series/)
POPULAR POSTS
(https://datahack.analyticsvidhya.com/contest/avdatafest-
skillpower-time-series/)
(https://datahack.analyticsvidhya.com/contest/skilltest-
probability/)
(http://imarticus.org/diploma-in-big-data-
analytics?id=AnalyticsVidhya)
RECENT POSTS
(https://datahack.analyticsvidhya.com/contest/avdatafest-
skillpower-time-series/)
(https://www.analyticsvidhya.com/blog/2017/03/datafest-panel-
discussion-webinars-event-point-calculator/)
AV DataFest 2017 The Panel discussion, Knowledge Intensive Webinars and Prize details!
(https://www.analyticsvidhya.com/blog/2017/03/datafest-panel-discussion-webinars-
event-point-calculator/)
KUNAL JAIN , MARCH 29, 2017
(https://www.analyticsvidhya.com/blog/2017/03/measuring-audience-
sentiments-about-movies-using-twitter-and-text-analytics/)
Measuring Audience Sentiments about Movies using Twitter and Text Analytics
(https://www.analyticsvidhya.com/blog/2017/03/measuring-audience-sentiments-about-
movies-using-twitter-and-text-analytics/)
GUEST BLOG , MARCH 29, 2017
(https://www.analyticsvidhya.com/blog/2017/03/extracting-information-
from-reports-using-regular-expressons-library-in-python/)
Extracting information from reports using Regular Expressions Library in Python
(https://www.analyticsvidhya.com/blog/2017/03/extracting-information-from-reports-
using-regular-expressons-library-in-python/)
(https://datahack.analyticsvidhya.com/contest/skilltest-
YOGESH KULKARNI , MARCH 29, 2017
probability/)
(https://www.analyticsvidhya.com/blog/2017/03/tensorflow-
understanding-tensors-and-graphs/)
TensorFlow 101: Understanding Tensors and Graphs to get you started in Deep Learning
(https://www.analyticsvidhya.com/blog/2017/03/tensorflow-understanding-tensors-and-
graphs/)
GUEST BLOG , MARCH 29, 2017
(https://datahack.analyticsvidhya.com/contest/avdatafest-
skillpower-time-series/)
ABOUT US
For those of you, who are wondering what is Analytics Vidhya, Analytics can be defined
as the science of extracting insights from raw data. The spectrum of analytics starts from
capturing data and evolves into using insights / trends from this data to make informed
decisions. Read More (http://www.analyticsvidhya.com/about-me/)
(http://www.edvancer.in/certified-data-
scientist-with-python-course?
utm_source=AV&utm_medium=AVads&utm_campaign=AVadsnonfc&utm_content=pythonavad)
GET CONNECTED
(https://datahack.analyticsvidhya.com/contest/skilltest-
probability/)
9,110 30,086
FOLLOWERS FOLLOWERS
(http://www.twitter.com/analyticsvidhya) (http://www.facebook.com/Analyticsvidhya)
1,716 Email
FOLLOWERS SUBSCRIBE
(https://plus.google.com/+Analyticsvidhya) (http://feedburner.google.com/fb/a/mailverify?
uri=analyticsvidhya)
STAY CONNECTED
(https://datahack.analyticsvidhya.com/contest/avdatafest-
9,108
skillpower-time-series/) 30,058
FOLLOWERS FOLLOWERS
(http://www.twitter.com/analyticsvidhya) (http://www.facebook.com/Analyticsvidhya)
1,718 Email
FOLLOWERS SUBSCRIBE
(https://plus.google.com/+Analyticsvidhya) (https://feedburner.google.com/fb/a/mailverify?
uri=analyticsvidhya)
LATEST POSTS
(https://www.analyticsvidhya.com/blog/2017/03/datafest-panel-
AV DataFest 2017 The Panel discussion, Knowledge Intensive Webinars and Prize details!
discussion-webinars-event-point-calculator/)
(https://www.analyticsvidhya.com/blog/2017/03/datafest-panel-discussion-webinars-
event-point-calculator/)
KUNAL JAIN , MARCH 29, 2017
(https://www.analyticsvidhya.com/blog/2017/03/measuring-audience-
sentiments-about-movies-using-twitter-and-text-analytics/)
Measuring Audience Sentiments about Movies using Twitter and Text Analytics
(https://www.analyticsvidhya.com/blog/2017/03/measuring-audience-sentiments-about-
movies-using-twitter-and-text-analytics/)
(https://datahack.analyticsvidhya.com/contest/skilltest-
GUEST BLOG , MARCH 29, 2017
probability/)
(https://www.analyticsvidhya.com/blog/2017/03/extracting-information-
from-reports-using-regular-expressons-library-in-python/)
Extracting information from reports using Regular Expressions Library in Python
(https://www.analyticsvidhya.com/blog/2017/03/extracting-information-from-reports-
using-regular-expressons-library-in-python/)
YOGESH KULKARNI , MARCH 29, 2017
(https://www.analyticsvidhya.com/blog/2017/03/tensorflow-
understanding-tensors-and-graphs/)
TensorFlow 101: Understanding Tensors and Graphs to get you started in Deep Learning
(https://www.analyticsvidhya.com/blog/2017/03/tensorflow-understanding-tensors-and-
graphs/)
GUEST BLOG , MARCH 29, 2017
QUICK LINKS
Home (https://www.analyticsvidhya.com/)
About Us
(https://www.analyticsvidhya.com/about-me/)
Our team
(https://www.analyticsvidhya.com/about-
me/team/)
Privacy Policy
(https://www.analyticsvidhya.com/privacy-
policy/)
Refund Policy
(https://www.analyticsvidhya.com/refund-
policy/)
Terms of Use
(https://www.analyticsvidhya.com/terms/)
(https://datahack.analyticsvidhya.com/contest/skilltest-
probability/)
TOP REVIEWS
(https://datahack.analyticsvidhya.com/contest/avdatafest-
skillpower-time-series/)