You are on page 1of 2

chmod 777 sample.

sh
./sample.sh

check in hive

mavan to connect spark with cloudera

for moving the code

jenkins

beyond compare

pam.xml
-----------------------------------------------
For 5+ years - hadoop 2 years.
Tools:- HDFS, YARN,
Datawarehouse :- Hive
ETL:- Pig, HQL, Spark with scala
Automation :- Shell scripting / bash
scheduler: oozie,
Injestion Tools:- Sqoop, flume

Don't add :- NoSql, mapreduce


-------------------------------------------------

Compare with production code if existing proj


how many type of datasouce, type of ETL,

recent task:-

Update and delete we never do HIve in real time.

dataconsistancy,
---------------------------------------------------------------
challenging:-
Hive:- struct datatype where column start with reserve character like (_).
date type in souce and hive. use Todate, Caste,
staging - txt file & datastore :- ORC file format
why not using ORC in staging,
Difference between ORC & packet & text.
Schema evaluation
ORC file format converts automatically to 256 MB for larget file (Indexing).

--------------------------------------------------------------------------------
input split :- IMP ques, It generate splits and send it to mapper.
speculative execution.
Data locality - name node,
Services in hadoop.
HDFS:
Master :- namenode
Master 2:- secondary name node
salve - datanode + node manager
master 3:- resource manager
Load balancer
Types of sechuders (Yarn & mapreduce) :-
Build in schedulers & ques which framework uses.
DAG - directed acyclic graph (data structure).
------------------------------------------------
compression technique.
HA node.
----------------------------------------
Resource manager :- we can chekc the cluster size, we cant chekc the volume because
we dnt have access to name node.
how many clusters:- 15 % of production - 10-15 nodes, QA 60%:- , prod:- 200
Rep factor:- 3
per node :- 2 tb, 200tb storage
data should be :- 2-3 gb daily details check module 8.

-----------------------------------------------------------------------------------
-----
spark,scala,shell script, Hive, hadoop, Bigdata,oozie, sqoop, flume keyword in
naukri and monster.

HIve:- hive serde for json xml, file format, how to implement UDF in hive, data
type in hive, Types of tables,can we create run time partition,

SCD type 2 implementation, Alter existing table in ORC file format,how to add
partitionining on year month day in hive,
Connect hive with spark with hive context and spark session.
Can we perform DML operation oh hive using spark.
How to set no of executors and total memonery while running hive job in spark.
Difference between broadcast variable and accumulator.
Persistance cache and persistance RDD.
Unix commands,
Partitioning and re partitioning in spark,
Split RDD
Dataset and dataframe diff.
-----------------------------------------------------------------------------------
----------------------------
hue interface online
demo.gethue.com - demo , demo
-----------------------------------------------------------------------------------
-------------------------
PIG scripts, Pig to spark script conversion.
OLTP, OLAP
Nosql theoram

nosql by martin fowler

You might also like