You are on page 1of 4

1) Google "Download Oracle VMBox"

https://www.virtualbox.org/wiki/Downloads
2) Log in to Google. Type "Hadoop different distributions"
3) Log in to Google. Type "Different companies using Hadoop"
4) https://uidai.gov.in/beta/
5) Setting up new Cluster
a) ifconfig (check ip address)
b) su - (go to root user)
c) vi /etc/hosts (update this file for multiple machines with their ip addresses
)
d) vi /etc/sysconfig/selinux. change the property of SELINUX from enforcing to d
isabled
e) java -version
f) ls /usr/lib/jvm (Java is residing here)
g) ls /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.79.x86_64/ (Check if there is any pr
ogram like jps)
h) yum install java-1.7.0-openjdk-devel (Need to install java in every machine)
i) yum install openssh-server (to setup password less access between machines)
j) ls -all /home/pulak/Downloads (Right now, nothiong is there. However, we need
hadoop tar files & few other files)
k) Go to Web browser & type "Hadoop Download Releases". Download Hadoop 1 & 2 ve
rsions of tar file

6) Big Data Hype


90% of the world s data was generated in the last few years.
The amount of data produced by us from the beginning of time till 2003 was 5 bil
lion gigabytes.
The same amount was created in every two days in 2011, and in every ten minutes
in 2013
7) Big Data comes from:
Big data involves the data produced by different devices and applications.
i) Social Media sites. Social media such as Facebook and Twitter hold informatio
n and the views posted by millions of people across the globe.
ii) Sensor networks
iii) Digital Images or Videos
iv) Cellphone GPS signals
v) Purchase Transaction records
vi) Web Logs
vii) Medical Records
viii) Medical Records
ix) Archives
x) Millitary Surveillance
xi) e-Commerce
xii) Complex Scientific Research
xiii) Server Logs
xiv) Call Detail Records (CDR)
xv) Black Box Data : It is a component of helicopter, airplanes, and jets, etc.
It captures voices of the flight crew, recordings of microphones and earphones,
and the performance information of the aircraft.
xvi) Stock Exchange Data : The stock exchange data holds information about the bu
y and sell decisions made on a share of different companies made by the customers.
xvii) Power Grid Data : The power grid data holds information consumed by a part
icular node with respect to a base station.
xviii) Transport Data : Transport data includes model, capacity, distance and av
ailability of a vehicle.
xix) Search Engine Data : Search engines retrieve lots of data from different da
tabases.
8) What is Big Data:
Big Data is a collection of large datasets that cannot be processed using tradit
ional computing techniques.
It is not a single technique or a tool, rather it involves many areas of busines
s and technology.
Volume: 15 TB of Facebook posts, 400 billion annual medical records
Velocity: processing 2 million records at share market, evaluating results of mi
llions of students applied for competitive exams
Variety: structured, unstructured, text, images, audio, video, log files, emails
, simulations, 3D models
Veracity (Uncertainty): Twitter posts with hash tags, abbreviations, typos and c
olloquial speech
9) Why is it important to harness Big Data?
a) There is a transition from the old saying Customer is King to Data is king !
b) Healthcare, banking, public sector, pharmaceutical, or IT, all need to look b
eyond the concrete data stored in their databases and study the intangible data
in the form of sensors, images, weblogs, etc.
c) what sets smart organizations apart from others is their ability to scan data
effectively to allocate resources properly, increase productivity and inspire i
nnovation!
d) Using the information kept in the social network like Facebook, the marketing
agencies are learning about the response for their campaigns, promotions, and o
ther advertising mediums.
e) Using the information in the social media like preferences and product percep
tion of their consumers, product companies and retail organizations are planning
their production.
f) Using the data regarding the previous medical history of patients, hospitals
are providing better and quick service.
g) Precise analysis
h) Better Recommendation Engines
i) Trends
j) Hidden insight from data
k) Access to 100% of data

10) Deficits of RDBMS


=====================
a) All data are not online
b) Rest of the data are archived
c) Don't have access to 100% data
d) All the data comes from one place (If workload increases, then what? Give exa
mple of elephant, wooden logs)
e) 1 big m/c vs 100 commodity m/c - which is expensive?
11) Why Big Data Analysis is critical:
=================================
a) Factors of production
b) unveil useful and crucial information impacting decision making process.
c) Customer Segmentation, focus on more profitable and loyal customers
d) Focus on next line of Products/services
e) Can provide tough competition to competitors
12) Why Hadoop
a) Flexible: Structured (20%), unstructured (80%) Schema less Java, other progra
mming languages
b) Scalable:
c) Fault Tolerant
d) Processing of big data. Dam. Unlimited flow of data economically.
e) Robust ecosystem. Broad spectrum of needs.
f) getting real time access to data
g) Reduction in cost /TB of storage - Commodity machines.
h) Getting ready for Cloud. Most required Apps for the cloud.
i) Integral part of many MNCs

13) Case Studies:


a) Fraud detection (Real time analysis), Credit card, Telecom, on line advertisi
ng pay per click
b) Sentiment Analysis
c) Log/ Weblog Analysis
14) Companies using Hadoop
a) Hadoop clusters at Yahoo! span 40,000 servers, and store 40 petabytes of appl
ication data, with the largest cluster being 4000 servers
b) Facebook - Managing profiles, posts, comments, images, videos of more than 1.
3 billion active users
c) Linkedin - Managing more than 1 billion recommendations every week.
d) Big Ad agencies - Keeps track of millions od Ad clicks, how users are respond
ing to the ads
e) Spadac.com - Analysing huge volumes of scientific data
15) Hadoop growth (& what are opportunities)
========================================
a) Revenue growth at 31.7% / year (based on IDC research), 6 times the growth ra
te of overall other ICT sectors
b) 23.8 billion $ market in 2016, 32.4 billion $ in 2017
16) Which companies are looking for Hadoop Professionals:
=====================================================
Google, EMC Corporation, Yahoo, Apple, HortonWorks, Oracle, Amazon, Cloudera, IB
M, Cisco, Microsoft and many more
17) Hadoop Job Roles:
====================
1) Hadoop Architect:
a) Administers Unix/Linux environments
b) Designs Hadoop Architecture involving Cluster node configuration, namenode
/datanode, connectivity, etc.
c) Organize, administer, manages and govern Hadoop on large clusters.
d) Documentation for Hadoop based production environment involving Petabytes
of data.
2) Hadoop Developer
3) Data Scientists/ Business Analyst
a) Generate, evaluate, spread and integrate the knowledge gathered and stored
in Hadoop environments
b) Write code, design intelligent analytic models, work with databases, get i
nvolved into very complex SQL, and so on.
c) Responsible for spotting the most crucial issues and working on those
d) Analyze data from various sources, instead of relying on a single source.
4) Hadoop Administrator
a) involved in administering Hadoop and its database systems.
b) troubleshooting and resolving issues
c) maintain large clusters and should have strong scripting skills

You might also like