Is Pig UDF a subset function

Pig UDF
Generally Pig having some Built-in functions, we can use that Built-in functions for our Pig Script with
out adding any extra code but sometimes user requirement is not available in that built-in functions at
that time user can write some own custom user defined functions called UDF (user defined function).
Steps to create Pig UDF
Step 1 :-
Open your Eclipse and create a java Class Name like Ucfirst.java
Step 2 :-
You should add jar files to that Project folder like
Right Click on project —> Build Path —> Configure Build Path —> Libraries —> Add External Jars —>
Select Hadoop and Pig Lib folder Jars files and Add other Jars files In Hadoop folder —–> Click Ok.
Step 3 :-
Now your Pig java program is supported in your eclipse without any errors. The basic step in Pig UDF is
public class Ucfirst extends EvalFunc<Class DataType> and you return the value.
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
public class Ucfirst extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
if (input.size() == 0)
return null;
try {
String str = (String) input.get(0);
char ch = str.toUpperCase().charAt(0);
String str1 = String.valueOf(ch);
return str1;
} catch (Exception e) {
// TODO: handle exception
throw WrappedIOException.wrap(
"Caught exception processing input row ", e);
Step 4 :-
public String exec(Tuple input) throws IOException {

if (input.size() == 0)
return null;
Class Name String and The entire row in text file is consider as Tuple and first of all it will check the input
is zero or not if the input is zero then it return null.
Step 5 :-
Try Catch Block,we have to write the logic in Try Block
try {
String str = (String) input.get(0);
char ch = str.toUpperCase().charAt(0);
String str1 = String.valueOf(ch);
return str1;
Step 6 :-
Catch Block only for exception Handling
How to Execute this code In Pig UDF ?
Step 1 :-
Right click on program —> Export —> create Jar
Step 2 :-
Register Jarname;
Step 3 :-
Write The Pig Script
REGISTER ucfirst.jar;
A = LOAD ‘sample.txt’ as (logid:chararray);
B = FOREACH A GENERATE myudfs.Ucfirst(logid);
DUMP B;
In the above Script myudfs is Package name and Ucfirst is class name
pig -x local ucfirst.pig
Output
(M)
(S)
(R)
(R)
Example 2
(User Defined Function).

Pig’s Java UDF extends functionalities of EvalFunc. This abstract class have an abstract
method “exec” which user needs to implement in concrete class with appropriate functionality.
Problem Statement:
Lets write a simple Java UDF which takes input as Tuple of two DataBag and check whether
second databag(set) is subset of first databag(set).
For example, Assume you have been given tuple of two databags. Each DataBag contains
elements(tuples) as number.
Input:
Databag1 : {(10),(4),(21),(9),(50)}
Databag2 : {(9),(4),(50)}
Output:
True
Then function should return true as Databag2 is subset of Databag1.
From implemetation point of view
As we are extending abstract class EvalFucn, we will be implementing exec function. In this
function we’ll write logic to find is given set is subset of other or not. We will also override
function outputSchema to specify output schema ( boolean : true or false ).
import java.io.IOException;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Set;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.impl.logicalLayer.schema.Schema.FieldSchema;
/**
* Find the whether given SetB is subset of SetA.
* 
* input:
* setA : {(10),(4),(21),(9),(50)}
* setB : {(9),(4),(50)}
* 
* output:
* true
* 
*/
public class IsSubSet extends EvalFunc {
@Override
public Schema outputSchema(Schema input) {
if(input.size()!=2){
throw new IllegalArgumentException("input should contains two elements!");
List fields = input.getFields();
for(FieldSchema f : fields){
if(f.type != DataType.BAG){
throw new IllegalArgumentException("input fields should be bag!");
return new Schema(new FieldSchema("isSubset",DataType.BOOLEAN));
private Set populateSet(DataBag dataBag){
HashSet set = new HashSet();
Iterator iter = dataBag.iterator();
while(iter.hasNext()){
set.add(iter.next());
return set;
@Override
public Boolean exec(Tuple input) throws IOException {
Set setA = populateSet((DataBag) input.get(0));
Set setB = populateSet((DataBag) input.get(1));
return setA.containsAll(setB) ? Boolean.TRUE : Boolean.FALSE;
}
}
Lets test our UDF to find whether given set is subset of other set or not.
PIG UDF
1 -- Register jar which contains UDF.
2 register '/home/hadoop/udf.jar';
4 -- Define function for use.
5 define isSubset IsSubSet();
7 -- lets assume we have dataset as following :
8 dump datset;
9 --({(10),(4),(21),(9),(50)},{(9),(4),(50)})
1 --({(50),(78),(45),(7),(4)},{(7),(45),(50)})
0
--({(1),(2),(3),(4),(5)},{(4),(3),(50)})
1
1
1 -- lets check subset function

2
result = foreach dataset generate $0,$1,isSubset($0,$1);
1
3
dump result;
1
4 --({(10),(4),(21),(9),(50)},{(9),(4),(50)},true)
1 --({(50),(78),(45),(7),(4)},{(7),(45),(50)},false)
5
--({(1),(2),(3),(4),(5)},{(4),(3),(50)},false)
1
6
1
7
1
8
1
9

Is Pig UDF a subset function

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Is Pig UDF a subset function

Uploaded by

Copyright:

Available Formats

Pig UDF

Steps to create Pig UDF

You should add jar files to that Project folder like

public class Ucfirst extends EvalFunc<String> {

public String exec(Tuple input) throws IOException {

String str1 = String.valueOf(ch);

// TODO: handle exception

"Caught exception processing input row ", e);

public String exec(Tuple input) throws IOException {

Try Catch Block,we have to write the logic in Try Block

Catch Block only for exception Handling

How to Execute this code In Pig UDF ?

Right click on program —> Export —> create Jar

Write The Pig Script

pig -x local ucfirst.pig

(User Defined Function).

Then function should return true as Databag2 is subset of Databag1.

From implemetation point of view

* Find the whether given SetB is subset of SetA.

public class IsSubSet extends EvalFunc {

public Schema outputSchema(Schema input) {

throw new IllegalArgumentException("input should contains two elements!");

List fields = input.getFields();

throw new IllegalArgumentException("input fields should be bag!");

return new Schema(new FieldSchema("isSubset",DataType.BOOLEAN));

private Set populateSet(DataBag dataBag){

HashSet set = new HashSet();

Iterator iter = dataBag.iterator();

public Boolean exec(Tuple input) throws IOException {

Set setA = populateSet((DataBag) input.get(0));

Set setB = populateSet((DataBag) input.get(1));

return setA.containsAll(setB) ? Boolean.TRUE : Boolean.FALSE;

1 -- Register jar which contains UDF.

4 -- Define function for use.

5 define isSubset IsSubSet();

7 -- lets assume we have dataset as following :

1 -- lets check subset function

You might also like