May 26, 2021 Apache Pig
In general, Apache Pig works on Hadoop. I t is an analysis tool for analyzing large data sets that exist in the H adoop F ile S ystem. T o analyze data using Apache Pig, we must first load the data into Apache Pig. This chapter describes how to load data from HDFS to Apache Pig.
In MapReduce mode, Pig reads (loads) data from HDFS and saves the results back to HDFS. /b10> So let's start with HDFS and create the following sample data in HDFS.
Student ID | Name | Surname | Phone number | City |
---|---|---|---|---|
001 | Rajiv | Reddy | 9848022337 | Hyderabad |
002 | siddarth | Battacharya | 9848022338 | Kolkata |
003 | Rajesh | Khanna | 9848022339 | Delhi |
004 | Preethi | Agarwal | 9848022330 | Pune |
005 | Trupthi | Mohanthy | 9848022336 | Bhuwaneshwar |
006 | Archana | Mishra | 9848022335 | Chennai |
The above dataset contains personal details for six students, such as id, first name, last name, phone number, and city.
First, verify the installation with the Hadoop version command, as shown below.
$ hadoop version
If you have Hadoop in your system and you have set the PATH variable, you will get the following output -
Hadoop 2.6.0 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1 Compiled by jenkins on 2014-11-13T21:10Z Compiled with protoc 2.5.0 From source with checksum 18e43357c8f927c0695f1e9522859d6a This command was run using /home/Hadoop/hadoop/share/hadoop/common/hadoop common-2.6.0.jar
Browse Hadoop's sbin directory and start yarn and Hadoop dfs (distributed file system), as shown below.
cd /$Hadoop_Home/sbin/ $ start-dfs.sh localhost: starting namenode, logging to /home/Hadoop/hadoop/logs/hadoopHadoop-namenode-localhost.localdomain.out localhost: starting datanode, logging to /home/Hadoop/hadoop/logs/hadoopHadoop-datanode-localhost.localdomain.out Starting secondary namenodes [0.0.0.0] starting secondarynamenode, logging to /home/Hadoop/hadoop/logs/hadoop-Hadoopsecondarynamenode-localhost.localdomain.out $ start-yarn.sh starting yarn daemons starting resourcemanager, logging to /home/Hadoop/hadoop/logs/yarn-Hadoopresourcemanager-localhost.localdomain.out localhost: starting nodemanager, logging to /home/Hadoop/hadoop/logs/yarnHadoop-nodemanager-localhost.localdomain.out
In Hadoop DFS, you can use the mkdir command to create a directory. Create a new directory named "Creating" in the path Pig_Data HDFS, as shown below.
$cd /$Hadoop_Home/bin/ $ hdfs dfs -mkdir hdfs://localhost:9000/Pig_Data
Pig's input file contains each unit/record in a single row. /b10> The entities of the record are separated by separators (in our example, we use ","). /b11> In the local file system, create an input file that contains data student_data.txt, as shown below.
001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai.
Now use the put command to move files from the local file system to HDFS, as shown below. /b10> (You can also use the copyFromLocal command.) )
$ cd $HADOOP_HOME/bin $ hdfs dfs -put /home/Hadoop/Pig/Pig_Data/student_data.txt dfs://localhost:9000/pig_data/
Use the cat command to verify that the file has been moved into HDFS, as shown below.
$ cd $HADOOP_HOME/bin $ hdfs dfs -cat hdfs://localhost:9000/pig_data/student_data.txt
Now you can see the contents of the file, as shown below.
15/10/01 12:16:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai
You can use Pig Latin's LOAD operator to load data from the file system (HDFS/Local) into Apache Pig.
The load statement consists of two parts, separated by the "" O n the left, we need to mention the name of the relationship we want to store the data for, and on the right, we need to define how the data is stored. The syntax of the Load operator is given below.
Relation_name = LOAD 'Input file path' USING function as schema;
Description:
relation_name - We have to mention the relationship to store the data.
Input file path - We must mention the HDFS directory where the files are stored. (in MapReduce mode).
function - We have to select a function from a set of load functions provided by Apache Pig (Bin Storage, Jason Loader, PigStorage, TextLoader).
Schema - We have to define the pattern of the data, which can define the desired pattern as follows
(column1 : data type, column2 : data type, column3 : data type);
Note: We load the data without specifying the pattern. /b10> In this case, the column will be addressed as $01, $02, etc... ( Check).
For example, we use the LOAD command to load data in pig-in-student_data.txt called Student.
First, open the Linux terminal. /b10> Start the Pig Grunt shell in MapReduce mode, as shown below.
$ Pig –x mapreduce
It will start the Pig Grunt shell, as shown below.
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE 15/10/01 12:33:37 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType 2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35 2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main - Logging error messages to: /home/Hadoop/pig_1443683018078.log 2015-10-01 12:33:38,242 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/Hadoop/.pigbootup not found 2015-10-01 12:33:39,630 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000 grunt>
Now, by executing the following Pig Latin statement in the Grunt shell, the data student_data.txt file file is loaded into pig.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
The following is a description of the above instructions.
Relation name | We have stored the data in student mode. | ||||||||||||
Input file path | We read the data from the pig_data file in the student_data.txt/directory of HDFS. | ||||||||||||
Storage function | We used the PigStorage() function to load and store the data as a structured text file. /b11> It takes separators, separated by each entity of the meta-group as an argument. /b12> By default, it is parametered by the value of "" | ||||||||||||
schema |
We have stored the data using the following pattern.
|
Note: The Load statement simply loads the data into pig's specified relationship. To verify the execution of the Load statement, you must use the Diagnostic operator, which is discussed in a later section.