Professional Documents
Culture Documents
Generally Pig having some Built-in functions, we can use that Built-in functions for our Pig Script with
out adding any extra code but sometimes user requirement is not available in that built-in functions at
that time user can write some own custom user defined functions called UDF (user defined function).
Step 1 :-
Open your Eclipse and create a java Class Name like Ucfirst.java
Step 2 :-
Right Click on project —> Build Path —> Configure Build Path —> Libraries —> Add External Jars —>
Select Hadoop and Pig Lib folder Jars files and Add other Jars files In Hadoop folder —–> Click Ok.
Step 3 :-
Now your Pig java program is supported in your eclipse without any errors. The basic step in Pig UDF is
public class Ucfirst extends EvalFunc<Class DataType> and you return the value.
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
if (input.size() == 0)
return null;
try {
String str = (String) input.get(0);
char ch = str.toUpperCase().charAt(0);
return str1;
} catch (Exception e) {
throw WrappedIOException.wrap(
Step 4 :-
Class Name String and The entire row in text file is consider as Tuple and first of all it will check the input
is zero or not if the input is zero then it return null.
Step 5 :-
try {
String str = (String) input.get(0);
char ch = str.toUpperCase().charAt(0);
String str1 = String.valueOf(ch);
return str1;
Step 6 :-
Step 1 :-
Step 2 :-
Register Jarname;
Step 3 :-
REGISTER ucfirst.jar;
A = LOAD ‘sample.txt’ as (logid:chararray);
B = FOREACH A GENERATE myudfs.Ucfirst(logid);
DUMP B;
In the above Script myudfs is Package name and Ucfirst is class name
Output
(M)
(S)
(R)
(R)
Example 2
Problem Statement:
Lets write a simple Java UDF which takes input as Tuple of two DataBag and check whether
second databag(set) is subset of first databag(set).
For example, Assume you have been given tuple of two databags. Each DataBag contains
elements(tuples) as number.
Input:
Databag1 : {(10),(4),(21),(9),(50)}
Databag2 : {(9),(4),(50)}
Output:
True
As we are extending abstract class EvalFucn, we will be implementing exec function. In this
function we’ll write logic to find is given set is subset of other or not. We will also override
function outputSchema to specify output schema ( boolean : true or false ).
import java.io.IOException;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Set;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.impl.logicalLayer.schema.Schema.FieldSchema;
/**
* <p>
* input:
* <br>setA : {(10),(4),(21),(9),(50)}</br>
* <br>setB : {(9),(4),(50)}</br>
* <br></br>
* output:
* <br>true</br>
* </p>
*/
@Override
if(input.size()!=2){
for(FieldSchema f : fields){
if(f.type != DataType.BAG){
while(iter.hasNext()){
set.add(iter.next());
return set;
@Override
}
}
Lets test our UDF to find whether given set is subset of other set or not.
PIG UDF
2 register '/home/hadoop/udf.jar';
8 dump datset;
9 --({(10),(4),(21),(9),(50)},{(9),(4),(50)})
1 --({(50),(78),(45),(7),(4)},{(7),(45),(50)})
0
--({(1),(2),(3),(4),(5)},{(4),(3),(50)})
1
1
1 --({(50),(78),(45),(7),(4)},{(7),(45),(50)},false)
5
--({(1),(2),(3),(4),(5)},{(4),(3),(50)},false)
1
6
1
7
1
8
1
9