May 26, 2021 Apache Pig
This chapter describes how to download, install, and set up Apache Pig in your system.
Hadoop and Java must be installed on the system before you can run Apache Pig. Therefore, before installing Apache Pig, follow the steps provided in the link below to install Hadoop and Java: //www.w3cschool.cn/hadoop/hadoop_enviornment_setup.htm
First, download the latest version of Apache Pig: https://pig.apache.org/
Open the home page of the Apache Pig website. /b10> Under the News section, click the link release page, as shown in the snapshot below.
When you click on the specified link, you will be redirected to the Apache Pig Releases page. /b10> Under the Download section of this page, click the link and you will be redirected to a page with a set of mirrors.
Select and click either of these images, as shown below.
These images will take you to the Pig Releases page. /b10> This page contains various versions of Apache Pig. /b11> Click the latest version of it.
In these folders, there are source files and binary files for Apache Pig in the distribution. Download the source and binary tar files for Apache Pig 0.16, pig0.16.0-src.tar.gz and pig-0.16.0.tar.gz.
After downloading apache Pig software, follow these steps to install it in a Linux environment.
Create a directory named Pig in the same directory where Hadoop, Java, and other software are installed. /b10> (In our tutorial, we created the Pig directory in a user named Hadoop.)
$ mkdir Pig
Extract the downloaded tar file, as shown below.
$ cd Downloads/ $ tar zxvf pig-0.15.0-src.tar.gz $ tar zxvf pig-0.15.0.tar.gz
Move the contents of the pig-0.16.0-src .tar.gz file to the Pig directory you created earlier, as shown below.
$ mv pig-0.16.0-src.tar.gz/* /home/Hadoop/Pig/
After installing Apache Pig, we have to configure it. To configure, we need to edit two files - bashrc and pig.properties.
In the .bashrc file, set the following variables
PIG_HOME the installation folder of Apache Pig from the folder
Path environment variables are copied to the bin folder
PIG_CLASSPATH environment variables are copied to the etc folder where Hadoop is installed (containing a directory of core-site.xml, hdfs-site.xml and mapred-site .xml files).
export PIG_HOME = /home/Hadoop/Pig export PATH = PATH:/home/Hadoop/pig/bin export PIG_CLASSPATH = $HADOOP_HOME/conf
In Pig's conf folder, we have a file called pig.properties. /b10> In the pig.properties file, you can set the various parameters shown below.
pig -h properties
The following properties are supported:
Logging: verbose = true|false; default is false. This property is the same as -v switch brief=true|false; default is false. This property is the same as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO. This property is the same as -d switch aggregate.warning = true|false; default is true. If true, prints count of warnings of each type rather than logging each warning. Performance tuning: pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory). Note that this memory is shared across all large bags used by the application. pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory). Specifies the fraction of heap available for the reducer to perform the join. pig.exec.nocombiner = true|false; default is false. Only disable combiner as a temporary workaround for problems. opt.multiquery = true|false; multiquery is on by default. Only disable multiquery as a temporary workaround for problems. opt.fetch=true|false; fetch is on by default. Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs. pig.tmpfilecompression = true|false; compression is off by default. Determines whether output of intermediate jobs is compressed. pig.tmpfilecompression.codec = lzo|gzip; default is gzip. Used in conjunction with pig.tmpfilecompression. Defines compression type. pig.noSplitCombination = true|false. Split combination is on by default. Determines if multiple small files are combined into a single map. pig.exec.mapPartAgg = true|false. Default is false. Determines if partial aggregation is done within map phase, before records are sent to combiner. pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10. If the in-map partial aggregation does not reduce the output num records by this factor, it gets disabled. Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same as -x switch pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command. udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF. stop.on.failure = true|false; default is false. Set to true to terminate on the first error. pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host. Determines the timezone used to handle datetime datatype and UDFs. Additionally, any Hadoop property can be specified.
Verify the installation of Apache Pig by typing the version command. /b10> If the installation is successful, you will get a formal version of Apache Pig, as shown below.
$ pig –version Apache Pig version 0.16.0 (r1682971) compiled Jun 01 2015, 11:44:35