You are on page 1of 9

pig

pig is a scripting language..less coding part.lattin scripting language..pig is a temporaru storage..


hive is warehouse..for example..in cmd prmopt..if we write anything and we rempove it..then
automatically..we can loose the data..
to store some specifc table in pig..we have store command
client requrement -write pig script-goto mapreduce-processing on haddop system
-simple,fast and rich
multiple stages to come solutions
topics
1)relations(tables)
2)from hdfs to relations
3)from lfs to realtions
4) joins
5)complex data type
6)foreach,filter,shell scripting

1)
in hive we create table first after load in hdfs or lfs
but in pig we decide which data we are using either hdfs or lfs..
if its hdfs use.....pig
if its lfs use pig -x local after that we process
commands..
in single line we can create the relation and we can load into hdfsor lfs
dump is used for retriving data
instead of string i n hive...in pig we are using chararray
delimiters..
store in o/p command is used store..
commands
1) give the pig command then it will give grunt shell means using hdfs
2)hadoop fs -put desktop/emp.txt venkat
emp table is there
table creations:
3)grunt> emp= load '/user/cloudera/venkat/emp.txt' using PigStorage(',') as
(empid:int,ename:chararry,esal:double,dno:int);

4) grunt>describe emp; table is created and load
5) >dump
going into job tracker
6)suceessfully stored into emp table



2))))) take lfs in on the shell

commands
1)grunt> emp= load '/home/cloudera/desktop/emp.txt' using PigStorage(',') as
(empid:int,ename:chararry,esal:double,dno:int);
2) >dump emp
automatically get error because it will goto hdfs and we gave local path
so mention local path
take new shell
>$ pig - x local
2)grunt> emp= load '/home/cloudera/desktop/emp.txt' using PigStorage(',') as
(empid:int,ename:chararry,esal:double,dno:int);
3)dump emp
4) if its in local failed in hdfs..



#3) joins
2)grunt> dept= load '/home/cloudera/desktop/dept.txt' using PigStorage(',') as
(dno:int,dname:chararry,dloc:charArray);
3) >dump
store in to dept table is completed
now we have to do join between dept and emp
join
1) > empdept=join emp by dno,dept by dno;
2) 1) > empdept=join emp by dno leftouter ,dept by dno;
1) > empdept=join emp by dno right outer,dept by dno;
1) > empdept=join emp by dno full outer ,dept by dno;
this full outer i/f is stored in hdfs..
so
>store empdept into '/user/cloudera/empdept/:'
data is stored in hdfs
if we want to see the data
then
>$haddop fs -ls empdept
>$ hadoop fs -cat empdept/part-r-00000

--------------------------------------------------------------------------------------------------
common operations in pig..
!) foreach :- to get only few columns we use
>empdept_for=foreach empdept generate $0,$1; // $0 dno,$1 dname indexes

>dump empdept_for;


2) filter:-
>empdept_fil=filter empdept by empid==101;
>empdept_fil
filtered result will get
sort:-
3) empdept_sort=order empdept by esal;
give assending order
limit:-
>empdept_lim=limit empdept_sort 3;
> dump empdept_lim;
give max 3 values







complex data types in pig:
take the sample data complex.txt
(1,2,3)( venkat,2)
(4,5,6,)(ram,2)
(5,6,7)( sita, 3)

3 types of c dts
1)tuple --- relation...(1,2,3)-1tuple
(venkat,2) another tuple like this
EX)
>$ hadoop fs -ls ram
>$ hadoop fs -put desktop/complex.txt ram
loaded in ram directiory in hdfs
use hdfs file system....
>pig
grunt>A= load '/user/cloudera/ram/complex.txt' using PigStorage(' ')as
(t1:(t1a:int,t1b:int,t1c:int),t2:( t2a:chararray,t2b:int));
>dump A
so automatically loaded into A
we want only second tuple 1st column
so use for each command
> A_for= foreach A generate t2.t2a;
>dump A_for;
then automacally it will give result 2nd tuple 2nd column (venkat)
(Ram)
(sita)



bag :
bag is look like {(1,2,3,)}
{(4,5,6)}
{(4,5,6)}
save as bag.txt
store in hdfs first i.e
1) >hadoop fs -put 'dektop/bag.txt ' ram
stored into ram file in hdfs
2) >pig
3)grunt> A_bag= load '/user/clodera/ram/bag.txt' as ( B:{
T:(t1a:int,t1b:int,t1c:int)});
4) grunt>dump
o/p
({(1,2,3,)})
({(4,5,6)})
({(4,5,6)})

stored into A_bag
here i want 1st column so use like
grunt>A_for= foreach A_bag generate B.t1a;
>dump A_for
o/p
({(1)})
({(1)})
({(1)})



map:
map is key and value
i.e(key,value)
sample data is stored in hdfs
1)>hadoop fs -put '/desktop/sample.txt' ram
2)>pig
3)grunt>A_map=load '/user/clodera/ram/sample.txt' AS (M:map[]);
4)>dump A_map

o/p
here we have delimiter is #
([open#apache])
etc
5)using for each
6) grunt> A_mapfor= foreach A_map generate M# 'open';
7)grunt>dump;



shell script in pig:
datameer....gui in hadoop
1)emp =load ' /user/clodera/ram/emp.txt' using PigStorage(',') as
(empid:int,ename:chararray,esal:double,eno:int);
2) dept=load '/user/clodera/ram/dept/txt' as PigStorage (',') as
(deptid:int,dname:chararray,dno:int);
3) empdept=join emp by eno=dept by dno;
4) dump empdept
5) store empdept into '/user/clodera/ram/empdept/';
these 4 stmts re stored in one txt file and give the name to txt.pig
6) >pig txt.pig
then we get result

if we see in hdfs file systeem
then
1) >hadoop fs -ls ram/empdept
2) >hadoop fs -cat ram/empdept