You are on page 1of 8

Pig UDF

Generally Pig having some Built-in functions, we can use that Built-in functions for our Pig Script with
out adding any extra code but sometimes user requirement is not available in that built-in functions at
that time user can write some own custom user defined functions called UDF (user defined function).

Steps to create Pig UDF

Step 1 :-

Open your Eclipse and create a java Class Name like Ucfirst.java

Step 2 :-

You should add jar files to that Project folder like

Right Click on project —> Build Path —> Configure Build Path —> Libraries —> Add External Jars —>
Select Hadoop and Pig Lib folder Jars files and Add other Jars files In Hadoop folder —–> Click Ok.

Step 3 :-

Now your Pig java program is supported in your eclipse without any errors. The basic step in Pig UDF is

public class Ucfirst extends EvalFunc<Class DataType> and you return the value.

package myudfs;

import java.io.IOException;

import org.apache.pig.EvalFunc;

import org.apache.pig.data.Tuple;

import org.apache.pig.impl.util.WrappedIOException;

public class Ucfirst extends EvalFunc<String> {

public String exec(Tuple input) throws IOException {

if (input.size() == 0)

return null;

try {
String str = (String) input.get(0);

char ch = str.toUpperCase().charAt(0);

String str1 = String.valueOf(ch);

return str1;

} catch (Exception e) {

// TODO: handle exception

throw WrappedIOException.wrap(

"Caught exception processing input row ", e);

Step 4 :-

public String exec(Tuple input) throws IOException {


if (input.size() == 0)
return null;

Class Name String and The entire row in text file is consider as Tuple and first of all it will check the input
is zero or not if the input is zero then it return null.

Step 5 :-

Try Catch Block,we have to write the logic in Try Block

try {
String str = (String) input.get(0);
char ch = str.toUpperCase().charAt(0);
String str1 = String.valueOf(ch);
return str1;
Step 6 :-

Catch Block only for exception Handling

How to Execute this code In Pig UDF ?

Step 1 :-

Right click on program —> Export —> create Jar

Step 2 :-

Register Jarname;

Step 3 :-

Write The Pig Script

REGISTER ucfirst.jar;
A = LOAD ‘sample.txt’ as (logid:chararray);
B = FOREACH A GENERATE myudfs.Ucfirst(logid);
DUMP B;

In the above Script myudfs is Package name and Ucfirst is class name

pig -x local ucfirst.pig

Output

(M)
(S)
(R)
(R)
Example 2

(User Defined Function).


Pig’s Java UDF extends functionalities of EvalFunc. This abstract class have an abstract
method “exec” which user needs to implement in concrete class with appropriate functionality.

Problem Statement:

Lets write a simple Java UDF which takes input as Tuple of two DataBag and check whether
second databag(set) is subset of first databag(set).
For example, Assume you have been given tuple of two databags. Each DataBag contains
elements(tuples) as number.

Input:
Databag1 : {(10),(4),(21),(9),(50)}
Databag2 : {(9),(4),(50)}
Output:
True

Then function should return true as Databag2 is subset of Databag1.

From implemetation point of view

As we are extending abstract class EvalFucn, we will be implementing exec function. In this
function we’ll write logic to find is given set is subset of other or not. We will also override
function outputSchema to specify output schema ( boolean : true or false ).

import java.io.IOException;

import java.util.HashSet;

import java.util.Iterator;

import java.util.List;

import java.util.Set;

import org.apache.pig.EvalFunc;

import org.apache.pig.data.DataBag;

import org.apache.pig.data.DataType;

import org.apache.pig.data.Tuple;
import org.apache.pig.impl.logicalLayer.schema.Schema;

import org.apache.pig.impl.logicalLayer.schema.Schema.FieldSchema;

/**

* Find the whether given SetB is subset of SetA.

* <p>

* input:

* <br>setA : {(10),(4),(21),(9),(50)}</br>

* <br>setB : {(9),(4),(50)}</br>

* <br></br>

* output:

* <br>true</br>

* </p>

*/

public class IsSubSet extends EvalFunc {

@Override

public Schema outputSchema(Schema input) {

if(input.size()!=2){

throw new IllegalArgumentException("input should contains two elements!");

List fields = input.getFields();

for(FieldSchema f : fields){
if(f.type != DataType.BAG){

throw new IllegalArgumentException("input fields should be bag!");

return new Schema(new FieldSchema("isSubset",DataType.BOOLEAN));

private Set populateSet(DataBag dataBag){

HashSet set = new HashSet();

Iterator iter = dataBag.iterator();

while(iter.hasNext()){

set.add(iter.next());

return set;

@Override

public Boolean exec(Tuple input) throws IOException {

Set setA = populateSet((DataBag) input.get(0));

Set setB = populateSet((DataBag) input.get(1));

return setA.containsAll(setB) ? Boolean.TRUE : Boolean.FALSE;

}
}

Lets test our UDF to find whether given set is subset of other set or not.

PIG UDF

1 -- Register jar which contains UDF.

2 register '/home/hadoop/udf.jar';

4 -- Define function for use.

5 define isSubset IsSubSet();

7 -- lets assume we have dataset as following :

8 dump datset;

9 --({(10),(4),(21),(9),(50)},{(9),(4),(50)})
1 --({(50),(78),(45),(7),(4)},{(7),(45),(50)})
0
--({(1),(2),(3),(4),(5)},{(4),(3),(50)})
1
1

1 -- lets check subset function


2
result = foreach dataset generate $0,$1,isSubset($0,$1);
1
3
dump result;
1
4 --({(10),(4),(21),(9),(50)},{(9),(4),(50)},true)

1 --({(50),(78),(45),(7),(4)},{(7),(45),(50)},false)
5
--({(1),(2),(3),(4),(5)},{(4),(3),(50)},false)
1
6

1
7

1
8

1
9

You might also like