Apache Pig User Defined Function (UDF)

May 26, 2021 Apache Pig

In addition to the built-in functions, Apache Pig provides extensive support for U ser D efined functions (UDF: user-defined functions). /b10> With these UDFs, we can define our own functions and use them. /b11> UDF supports six programming languages, Java, Jython, Python, JavaScript, Ruby, and Groovy.

For writing UDFs, there is comprehensive support in Java and limited support in all other languages. /b10> With Java, you can write UDFs that involve processing all parts, such as data loading/storage, column transformation, and aggregation. /b11> Because Apache Pig is written in Java, UDFs written in Java languages are more productive than in other languages.

In Apache Pig, we also have a Java repository for UDF called Piggybank. /b10> With Diggybank, we can access Java UDF written by other users and contribute our own UDF.

The type of UDF in Java

When writing UDFs using Java, we can create and use three types of functions

Filter function - The Filter function is used as a condition in a filter statement. /b10> These functions accept the Pig value as input and return the Boolean value.
Eval function - Eval function is used in FOREACH-GENERATE statements. /b10> These functions accept the Pig value as input and return the Pig result.
The Algebraic function - The Algebraic function works on the inner package in the FOREACHGENERATE statement. /b10> These functions are used to perform a full MapReduce operation on the inner package.

Write uDFs using Java

To write UDFs using Java, we must integrate the jar file Pig-0.15.0 .jar. /b10> In this section, you'll discuss how to use Eclipse to write sample UDFs. /b11> Before you continue, make sure you have Eclipse and Maven installed in your system.

Follow the steps given below to write a UDF function:

Open Eclipse and create a new project, such as myproject.
Convert a newly created project to a Maven project.
Copy the .xml the pom file. /b10> This file contains the Maven dependencies of the Apache Pig and Hadoop-core jar files.

<project xmlns = "http://maven.apache.org/POM/4.0.0"
   xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation = "http://maven.apache.org/POM/4.0.0http://maven.apache .org/xsd/maven-4.0.0.xsd"> 
	
   <modelVersion>4.0.0</modelVersion> 
   <groupId>Pig_Udf</groupId> 
   <artifactId>Pig_Udf</artifactId> 
   <version>0.0.1-SNAPSHOT</version>
	
   <build>    
      <sourceDirectory>src</sourceDirectory>    
      <plugins>      
         <plugin>        
            <artifactId>maven-compiler-plugin</artifactId>        
            <version>3.3</version>        
            <configuration>          
               <source>1.7</source>          
               <target>1.7</target>        
            </configuration>      
         </plugin>    
      </plugins>  
   </build>
	
   <dependencies> 
	
      <dependency>            
         <groupId>org.apache.pig</groupId>            
         <artifactId>pig</artifactId>            
         <version>0.15.0</version>     
      </dependency> 
		
      <dependency>        
         <groupId>org.apache.hadoop</groupId>            
         <artifactId>hadoop-core</artifactId>            
         <version>0.20.2</version>     
      </dependency> 
      
   </dependencies>  
	
</project>

Save the file and refresh it. /b10> In the Maven Dependencies section, you can find the downloaded jar file.
Create a Sample_Eval file named File and copy the following in it.

import java.io.IOException; 
import org.apache.pig.EvalFunc; 
import org.apache.pig.data.Tuple; 
 
import java.io.IOException; 
import org.apache.pig.EvalFunc; 
import org.apache.pig.data.Tuple;

public class Sample_Eval extends EvalFunc<String>{ 

   public String exec(Tuple input) throws IOException {   
      if (input == null || input.size() == 0)      
      return null;      
      String str = (String)input.get(0);      
      return str.toUpperCase();  
   } 
}

When writing UDFs, you must inherit the EvalFunc class and provide an implementation to the exec() function. /b10> In this function, write the code required by the UDF. /b11> In the example above, we return code that converts the contents of a given column to capitals.

After compiling the class and confirming that there are no errors, right-click Sample_Eval.java file. /b10> It will render a menu. /b11> Select "export" as shown in the following screenshot.

Click "export" and you'll see the following window. /b10> Click on "JAR file".

Click on the "Next"button to continue. /b10> You will get another window where you need to enter a path in the local file system where the jar files are stored.

Finally, click the Finish button. /b10> In the specified folder, create a Jar file sample_udf.jar. /b11> This jar file contains UDFs written in Java.

Use UDF

After writing the UDF and generating the Jar file, follow the steps given below:

Step 1: Register the Jar file

After writing UDF (in Java), we must register the Jar file containing the UDF using the Register operator. /b10> By registering the Jar file, the user can bind the location of the UDF to Apache Pig.

Grammar

The syntax of the Reister operator is given below.

REGISTER path;

Cases

Let's register the information we created earlier sample_udf.jar. /b10> Start Apache Pig in local mode and register the jar file sample_udf.jar, as shown below.

$cd PIG_HOME/bin 
$./pig –x local 

REGISTER '/$PIG_HOME/sample_udf.jar'

Note: Suppose the Jar file in the path: /$PIG home/sample_udf.jar

Step 2: Define the alias

After you register the UDF, you can use the Define operator to define an alias for it.

Grammar

The syntax of the Define operator is given below.

DEFINE alias {function | [`command` [input] [output] [ship] [cache] [stderr] ] };

Cases

Define sample_eval alias for the name, as shown below.

DEFINE sample_eval sample_eval();

Step 3: Use UDF

After you define an alias, you can use the same UDF as the built-in function. /b10> Suppose you have a file Pig_Data the HDFS/Emp_data directory that contains the following.

001,Robin,22,newyork
002,BOB,23,Kolkata
003,Maya,23,Tokyo
004,Sara,25,London 
005,David,23,Bhuwaneshwar 
006,Maggy,22,Chennai
007,Robert,22,newyork
008,Syam,23,Kolkata
009,Mary,25,Tokyo
010,Saran,25,London 
011,Stacy,25,Bhuwaneshwar 
012,Kelly,22,Chennai

And suppose we've loaded this file into Pig, as shown below.

grunt> emp_data = LOAD 'hdfs://localhost:9000/pig_data/emp1.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, city:chararray);

Now use the UDF sample_eval the employee's name to capital.

grunt> Upper_case = FOREACH emp_data GENERATE sample_eval(name);

Please verify that the relationship Upper_case the content, as shown below.

grunt> Dump Upper_case;
  
(ROBIN)
(BOB)
(MAYA)
(SARA)
(DAVID)
(MAGGY)
(ROBERT)
(SYAM)
(MARY)
(SARAN)
(STACY)
(KELLY)