Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Apache Pig installation


May 26, 2021 Apache Pig


Table of contents


This chapter describes how to download, install, and set up Apache Pig in your system.

Prerequisite

Hadoop and Java must be installed on the system before you can run Apache Pig. Therefore, before installing Apache Pig, follow the steps provided in the link below to install Hadoop and Java: //www.w3cschool.cn/hadoop/hadoop_enviornment_setup.htm

Download Apache Pig

First, download the latest version of Apache Pig: https://pig.apache.org/

Step 1

Open the home page of the Apache Pig website. /b10> Under the News section, click the link release page, as shown in the snapshot below.

Apache Pig installation

Step 2

When you click on the specified link, you will be redirected to the Apache Pig Releases page. /b10> Under the Download section of this page, click the link and you will be redirected to a page with a set of mirrors.

Apache Pig installation

Step 3

Select and click either of these images, as shown below.

Apache Pig installation

Step 4

These images will take you to the Pig Releases page. /b10> This page contains various versions of Apache Pig. /b11> Click the latest version of it.

Apache Pig installation

Step 5

In these folders, there are source files and binary files for Apache Pig in the distribution. Download the source and binary tar files for Apache Pig 0.16, pig0.16.0-src.tar.gz and pig-0.16.0.tar.gz.

Apache Pig installation

Install Apache Pig

After downloading apache Pig software, follow these steps to install it in a Linux environment.

Step 1

Create a directory named Pig in the same directory where Hadoop, Java, and other software are installed. /b10> (In our tutorial, we created the Pig directory in a user named Hadoop.)

$ mkdir Pig

Step 2

Extract the downloaded tar file, as shown below.

$ cd Downloads/ 
$ tar zxvf pig-0.15.0-src.tar.gz 
$ tar zxvf pig-0.15.0.tar.gz 

Step 3

Move the contents of the pig-0.16.0-src .tar.gz file to the Pig directory you created earlier, as shown below.

$ mv pig-0.16.0-src.tar.gz/* /home/Hadoop/Pig/

Configure Apache Pig

After installing Apache Pig, we have to configure it. To configure, we need to edit two files - bashrc and pig.properties.

.bashrc file

In the .bashrc file, set the following variables

  • PIG_HOME the installation folder of Apache Pig from the folder

  • Path environment variables are copied to the bin folder

  • PIG_CLASSPATH environment variables are copied to the etc folder where Hadoop is installed (containing a directory of core-site.xml, hdfs-site.xml and mapred-site .xml files).

export PIG_HOME = /home/Hadoop/Pig
export PATH  = PATH:/home/Hadoop/pig/bin
export PIG_CLASSPATH = $HADOOP_HOME/conf

pig.properties file

In Pig's conf folder, we have a file called pig.properties. /b10> In the pig.properties file, you can set the various parameters shown below.

pig -h properties 

The following properties are supported:

Logging: verbose = true|false; default is false. This property is the same as -v
       switch brief=true|false; default is false. This property is the same 
       as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO.             
       This property is the same as -d switch aggregate.warning = true|false; default is true. 
       If true, prints count of warnings of each type rather than logging each warning.		 
		 
Performance tuning: pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory).
       Note that this memory is shared across all large bags used by the application.         
       pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory).
       Specifies the fraction of heap available for the reducer to perform the join.
       pig.exec.nocombiner = true|false; default is false.
           Only disable combiner as a temporary workaround for problems.         
       opt.multiquery = true|false; multiquery is on by default.
           Only disable multiquery as a temporary workaround for problems.
       opt.fetch=true|false; fetch is on by default.
           Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs.         
       pig.tmpfilecompression = true|false; compression is off by default.             
           Determines whether output of intermediate jobs is compressed.         
       pig.tmpfilecompression.codec = lzo|gzip; default is gzip.
           Used in conjunction with pig.tmpfilecompression. Defines compression type.         
       pig.noSplitCombination = true|false. Split combination is on by default.
           Determines if multiple small files are combined into a single map.         
			  
       pig.exec.mapPartAgg = true|false. Default is false.             
           Determines if partial aggregation is done within map phase, before records are sent to combiner.         
       pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10.             
           If the in-map partial aggregation does not reduce the output num records by this factor, it gets disabled.
			  
Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same as -x switch
       pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command.
       udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF.
       stop.on.failure = true|false; default is false. Set to true to terminate on the first error.         
       pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host.
           Determines the timezone used to handle datetime datatype and UDFs.
Additionally, any Hadoop property can be specified.

Verify the installation

Verify the installation of Apache Pig by typing the version command. /b10> If the installation is successful, you will get a formal version of Apache Pig, as shown below.

$ pig –version 
 
Apache Pig version 0.16.0 (r1682971)  
compiled Jun 01 2015, 11:44:35