pig tutorial - apache pig tutorial - Apache Pig - User Defined Functions - pig latin - apache pig - pig hadoop
What is User Defined Functions in Apache Pig ?
- In addition to the built-in functions, Apache Pig provides extensive support for User Defined Functions (UDF’s).
- Using these UDF’s, you can define your own functions and use them.
Learn apache pig - apache pig tutorial - apache pig user defined functions - apache pig examples - apache pig programs
Supporting languages:
- The UDF support is provided in six programming languages, namely, Java, Jython, Python, JavaScript, Ruby and Groovy.
- For writing UDF’s, complete support is provided in Java and limited support is provided in all the remaining languages.
- Using Java, we can write UDF’s involving all parts of the processing like data load/store, column transformation, and aggregation.
- Apache Pig has been written in Java, the UDF’s written using Java language work efficiently compared to other languages.
- In Apache Pig, you also have a Java repository for UDF’s named Piggybank. Using Piggybank, you can access Java UDF’s written with other users, and contribute your own UDF’s.
Types of UDF’s in Java
Writing UDF’s using Java, you can create and use the following three types of functions −
- Filter Functions − The filter functions are used as conditions in filter statements. These functions accept a Pig value as input and return a Boolean value.
- Eval Functions − The Eval functions are used in FOREACH-GENERATE statements. These functions accept a Pig value as input and return a Pig result.
- Algebraic Functions − The Algebraic functions act on inner bags in a FOREACHGENERATE statement. These functions are used to perform full MapReduce operations on an inner bag.
Writing UDF’s using Java:
- To write a UDF using Java, we have to integrate the jar file Pig-0.15.0.jar. In this section, we discuss how to write a sample UDF using Eclipse. Before proceeding further, make sure you have installed Eclipse and Maven in your system.
Follow the steps given below to write a UDF function,
Step 1
- Open Eclipse and create a new project (say myproject).
Step 2
- Convert the newly created project into a Maven project.
Step 3
- Copy the following content in the pom.xml.
- This file contains the Maven dependencies for Apache Pig and Hadoop-core jar files.
Step 4
- Save the file and refresh it. In the Maven Dependencies section, we can find the downloaded jar files.
Step 5
- Create a new class file with name Sample_Eval and copy the following content in it.
While Writing UDF’s, it is set to inherit the EvalFunc class and provide operation to exec() function. With in this function, the code required for the UDF is written.
- The above example, we have return the code to convert the contents of the specified column to uppercase.
- After compiling the class without errors, right-click on the Sample_Eval.java file. It gives you a menu. Select export as shown in the following screenshot.
Learn apache pig - apache pig tutorial - apache pig user defined-functions2 - apache pig examples - apache pig programs
- On click export, you will get the following window. Click on JAR file.
Learn apache pig - apache pig tutorial - apache pig user defined-functions3 - apache pig examples - apache pig programs
- Proceed further by clicking Next> button. You will get another window where you need to enter the path in the local file system, where you need to store the jar file.
Learn apache pig - apache pig tutorial - apache pig user defined-functions4 - apache pig examples - apache pig programs
- Finally click the Finish button. In the specified folder, a Jar file sample_udf.jar is created. This jar file contains the UDF written in Java.
Using the UDF:
- Once writing the UDF and generating the Jar file, follow the steps given below
Step 1:
Registering the Jar file
- After writing UDF (in Java) you have to register the Jar file that contain the UDF using the Register operator.
- By registering the Jar file, users can intimate the location of the UDF to Apache Pig.
Syntax:
The Register operator syntax is given below.
Example:
- As an example let us register the sample_udf.jar created previously in this chapter.
- Start Apache Pig in local mode and register the jar file sample_udf.jar as given below.
Note
− imagine the Jar file in the path − /$PIG_HOME/sample_udf.jar
Step 2:
Defining Alias
- After registering the UDF you can define an alias to it using the Define operator.
Syntax:
The syntax of the Define operator is shown below.
Example:
Define the alias for sample_eval as shown below.
Step 3:
Using the UDF
- Once defining the alias you can use the UDF same as the built-in functions. Assume there is a file named wikitechy_emp_data in the HDFS /Pig_Data/ directory with the following content.
Ensure you have loaded this file into Pig as given below.
- we convert the names of the employees in to upper case using the UDF sample_eval.
Verification:
- we are verify the contents of the relative Upper_case as given below.
More functions: Datafu Pig
- Stats: variance, quantiles, median, etc.
- Bags: concat, append, preped, etc.
- Sampling
- Page rank
- Session estimation