Professional Documents
Culture Documents
sh
./sample.sh
check in hive
jenkins
beyond compare
pam.xml
-----------------------------------------------
For 5+ years - hadoop 2 years.
Tools:- HDFS, YARN,
Datawarehouse :- Hive
ETL:- Pig, HQL, Spark with scala
Automation :- Shell scripting / bash
scheduler: oozie,
Injestion Tools:- Sqoop, flume
recent task:-
dataconsistancy,
---------------------------------------------------------------
challenging:-
Hive:- struct datatype where column start with reserve character like (_).
date type in souce and hive. use Todate, Caste,
staging - txt file & datastore :- ORC file format
why not using ORC in staging,
Difference between ORC & packet & text.
Schema evaluation
ORC file format converts automatically to 256 MB for larget file (Indexing).
--------------------------------------------------------------------------------
input split :- IMP ques, It generate splits and send it to mapper.
speculative execution.
Data locality - name node,
Services in hadoop.
HDFS:
Master :- namenode
Master 2:- secondary name node
salve - datanode + node manager
master 3:- resource manager
Load balancer
Types of sechuders (Yarn & mapreduce) :-
Build in schedulers & ques which framework uses.
DAG - directed acyclic graph (data structure).
------------------------------------------------
compression technique.
HA node.
----------------------------------------
Resource manager :- we can chekc the cluster size, we cant chekc the volume because
we dnt have access to name node.
how many clusters:- 15 % of production - 10-15 nodes, QA 60%:- , prod:- 200
Rep factor:- 3
per node :- 2 tb, 200tb storage
data should be :- 2-3 gb daily details check module 8.
-----------------------------------------------------------------------------------
-----
spark,scala,shell script, Hive, hadoop, Bigdata,oozie, sqoop, flume keyword in
naukri and monster.
HIve:- hive serde for json xml, file format, how to implement UDF in hive, data
type in hive, Types of tables,can we create run time partition,
SCD type 2 implementation, Alter existing table in ORC file format,how to add
partitionining on year month day in hive,
Connect hive with spark with hive context and spark session.
Can we perform DML operation oh hive using spark.
How to set no of executors and total memonery while running hive job in spark.
Difference between broadcast variable and accumulator.
Persistance cache and persistance RDD.
Unix commands,
Partitioning and re partitioning in spark,
Split RDD
Dataset and dataframe diff.
-----------------------------------------------------------------------------------
----------------------------
hue interface online
demo.gethue.com - demo , demo
-----------------------------------------------------------------------------------
-------------------------
PIG scripts, Pig to spark script conversion.
OLTP, OLAP
Nosql theoram