pig is a scripting language..less coding part.lattin scripting language..pig is a temporaru storage..
hive is warehouse..for example..in cmd prmopt..if we write anything and we rempove it..then automatically..we can loose the data.. to store some specifc table in pig..we have store command client requrement -write pig script-goto mapreduce-processing on haddop system -simple,fast and rich multiple stages to come solutions topics 1)relations(tables) 2)from hdfs to relations 3)from lfs to realtions 4) joins 5)complex data type 6)foreach,filter,shell scripting
1) in hive we create table first after load in hdfs or lfs but in pig we decide which data we are using either hdfs or lfs.. if its hdfs use.....pig if its lfs use pig -x local after that we process commands.. in single line we can create the relation and we can load into hdfsor lfs dump is used for retriving data instead of string i n hive...in pig we are using chararray delimiters.. store in o/p command is used store.. commands 1) give the pig command then it will give grunt shell means using hdfs 2)hadoop fs -put desktop/emp.txt venkat emp table is there table creations: 3)grunt> emp= load '/user/cloudera/venkat/emp.txt' using PigStorage(',') as (empid:int,ename:chararry,esal:double,dno:int);
4) grunt>describe emp; table is created and load 5) >dump going into job tracker 6)suceessfully stored into emp table
2))))) take lfs in on the shell
commands 1)grunt> emp= load '/home/cloudera/desktop/emp.txt' using PigStorage(',') as (empid:int,ename:chararry,esal:double,dno:int); 2) >dump emp automatically get error because it will goto hdfs and we gave local path so mention local path take new shell >$ pig - x local 2)grunt> emp= load '/home/cloudera/desktop/emp.txt' using PigStorage(',') as (empid:int,ename:chararry,esal:double,dno:int); 3)dump emp 4) if its in local failed in hdfs..
#3) joins 2)grunt> dept= load '/home/cloudera/desktop/dept.txt' using PigStorage(',') as (dno:int,dname:chararry,dloc:charArray); 3) >dump store in to dept table is completed now we have to do join between dept and emp join 1) > empdept=join emp by dno,dept by dno; 2) 1) > empdept=join emp by dno leftouter ,dept by dno; 1) > empdept=join emp by dno right outer,dept by dno; 1) > empdept=join emp by dno full outer ,dept by dno; this full outer i/f is stored in hdfs.. so >store empdept into '/user/cloudera/empdept/:' data is stored in hdfs if we want to see the data then >$haddop fs -ls empdept >$ hadoop fs -cat empdept/part-r-00000
-------------------------------------------------------------------------------------------------- common operations in pig.. !) foreach :- to get only few columns we use >empdept_for=foreach empdept generate $0,$1; // $0 dno,$1 dname indexes
>dump empdept_for;
2) filter:- >empdept_fil=filter empdept by empid==101; >empdept_fil filtered result will get sort:- 3) empdept_sort=order empdept by esal; give assending order limit:- >empdept_lim=limit empdept_sort 3; > dump empdept_lim; give max 3 values
complex data types in pig: take the sample data complex.txt (1,2,3)( venkat,2) (4,5,6,)(ram,2) (5,6,7)( sita, 3)
3 types of c dts 1)tuple --- relation...(1,2,3)-1tuple (venkat,2) another tuple like this EX) >$ hadoop fs -ls ram >$ hadoop fs -put desktop/complex.txt ram loaded in ram directiory in hdfs use hdfs file system.... >pig grunt>A= load '/user/cloudera/ram/complex.txt' using PigStorage(' ')as (t1:(t1a:int,t1b:int,t1c:int),t2:( t2a:chararray,t2b:int)); >dump A so automatically loaded into A we want only second tuple 1st column so use for each command > A_for= foreach A generate t2.t2a; >dump A_for; then automacally it will give result 2nd tuple 2nd column (venkat) (Ram) (sita)
bag : bag is look like {(1,2,3,)} {(4,5,6)} {(4,5,6)} save as bag.txt store in hdfs first i.e 1) >hadoop fs -put 'dektop/bag.txt ' ram stored into ram file in hdfs 2) >pig 3)grunt> A_bag= load '/user/clodera/ram/bag.txt' as ( B:{ T:(t1a:int,t1b:int,t1c:int)}); 4) grunt>dump o/p ({(1,2,3,)}) ({(4,5,6)}) ({(4,5,6)})
stored into A_bag here i want 1st column so use like grunt>A_for= foreach A_bag generate B.t1a; >dump A_for o/p ({(1)}) ({(1)}) ({(1)})
map: map is key and value i.e(key,value) sample data is stored in hdfs 1)>hadoop fs -put '/desktop/sample.txt' ram 2)>pig 3)grunt>A_map=load '/user/clodera/ram/sample.txt' AS (M:map[]); 4)>dump A_map
o/p here we have delimiter is # ([open#apache]) etc 5)using for each 6) grunt> A_mapfor= foreach A_map generate M# 'open'; 7)grunt>dump;
shell script in pig: datameer....gui in hadoop 1)emp =load ' /user/clodera/ram/emp.txt' using PigStorage(',') as (empid:int,ename:chararray,esal:double,eno:int); 2) dept=load '/user/clodera/ram/dept/txt' as PigStorage (',') as (deptid:int,dname:chararray,dno:int); 3) empdept=join emp by eno=dept by dno; 4) dump empdept 5) store empdept into '/user/clodera/ram/empdept/'; these 4 stmts re stored in one txt file and give the name to txt.pig 6) >pig txt.pig then we get result
if we see in hdfs file systeem then 1) >hadoop fs -ls ram/empdept 2) >hadoop fs -cat ram/empdept
Department of Computer Engineering Academic Year 2020-21 Class: SE Computer & IT Subject: 22226 PCI (Programming in C) MCQ Unit 1: Program Logic Development MCQ Question Bank With Answers