Apache Pig loads the data

May 26, 2021 Apache Pig

In general, Apache Pig works on Hadoop. I t is an analysis tool for analyzing large data sets that exist in the H adoop F ile S ystem. T o analyze data using Apache Pig, we must first load the data into Apache Pig. This chapter describes how to load data from HDFS to Apache Pig.

Prepare HDFS

In MapReduce mode, Pig reads (loads) data from HDFS and saves the results back to HDFS. /b10> So let's start with HDFS and create the following sample data in HDFS.

Student ID	Name	Surname	Phone number	City
001	Rajiv	Reddy	9848022337	Hyderabad
002	siddarth	Battacharya	9848022338	Kolkata
003	Rajesh	Khanna	9848022339	Delhi
004	Preethi	Agarwal	9848022330	Pune
005	Trupthi	Mohanthy	9848022336	Bhuwaneshwar
006	Archana	Mishra	9848022335	Chennai

The above dataset contains personal details for six students, such as id, first name, last name, phone number, and city.

Step 1: Verify Hadoop

First, verify the installation with the Hadoop version command, as shown below.

$ hadoop version

If you have Hadoop in your system and you have set the PATH variable, you will get the following output -

Hadoop 2.6.0 
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1 
Compiled by jenkins on 2014-11-13T21:10Z 
Compiled with protoc 2.5.0 
From source with checksum 18e43357c8f927c0695f1e9522859d6a 
This command was run using /home/Hadoop/hadoop/share/hadoop/common/hadoop
common-2.6.0.jar

Step 2: Start HDFS

Browse Hadoop's sbin directory and start yarn and Hadoop dfs (distributed file system), as shown below.

cd /$Hadoop_Home/sbin/ 
$ start-dfs.sh 
localhost: starting namenode, logging to /home/Hadoop/hadoop/logs/hadoopHadoop-namenode-localhost.localdomain.out 
localhost: starting datanode, logging to /home/Hadoop/hadoop/logs/hadoopHadoop-datanode-localhost.localdomain.out 
Starting secondary namenodes [0.0.0.0] 
starting secondarynamenode, logging to /home/Hadoop/hadoop/logs/hadoop-Hadoopsecondarynamenode-localhost.localdomain.out
 
$ start-yarn.sh 
starting yarn daemons 
starting resourcemanager, logging to /home/Hadoop/hadoop/logs/yarn-Hadoopresourcemanager-localhost.localdomain.out 
localhost: starting nodemanager, logging to /home/Hadoop/hadoop/logs/yarnHadoop-nodemanager-localhost.localdomain.out

Step 3: Create a directory in HDFS

In Hadoop DFS, you can use the mkdir command to create a directory. Create a new directory named "Creating" in the path Pig_Data HDFS, as shown below.

$cd /$Hadoop_Home/bin/ 
$ hdfs dfs -mkdir hdfs://localhost:9000/Pig_Data

Step 4: Put the data in HDFS

Pig's input file contains each unit/record in a single row. /b10> The entities of the record are separated by separators (in our example, we use ","). /b11> In the local file system, create an input file that contains data student_data.txt, as shown below.

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.

Now use the put command to move files from the local file system to HDFS, as shown below. /b10> (You can also use the copyFromLocal command.) ）

$ cd $HADOOP_HOME/bin 
$ hdfs dfs -put /home/Hadoop/Pig/Pig_Data/student_data.txt dfs://localhost:9000/pig_data/

Verify the file

Use the cat command to verify that the file has been moved into HDFS, as shown below.

$ cd $HADOOP_HOME/bin
$ hdfs dfs -cat hdfs://localhost:9000/pig_data/student_data.txt

Output

Now you can see the contents of the file, as shown below.

15/10/01 12:16:55 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
  
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai

Load operator

You can use Pig Latin's LOAD operator to load data from the file system (HDFS/Local) into Apache Pig.

Grammar

The load statement consists of two parts, separated by the "" O n the left, we need to mention the name of the relationship we want to store the data for, and on the right, we need to define how the data is stored. The syntax of the Load operator is given below.

Relation_name = LOAD 'Input file path' USING function as schema;

Description:

relation_name - We have to mention the relationship to store the data.
Input file path - We must mention the HDFS directory where the files are stored. (in MapReduce mode).
function - We have to select a function from a set of load functions provided by Apache Pig (Bin Storage, Jason Loader, PigStorage, TextLoader).
Schema - We have to define the pattern of the data, which can define the desired pattern as follows

(column1 : data type, column2 : data type, column3 : data type);

Note: We load the data without specifying the pattern. /b10> In this case, the column will be addressed as $01, $02, etc... ( Check).

Cases

For example, we use the LOAD command to load data in pig-in-student_data.txt called Student.

Start the Pig Grunt shell

First, open the Linux terminal. /b10> Start the Pig Grunt shell in MapReduce mode, as shown below.

$ Pig –x mapreduce

It will start the Pig Grunt shell, as shown below.

15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType

2015-10-01 12:33:38,080 [main] INFO  org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35
2015-10-01 12:33:38,080 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/Hadoop/pig_1443683018078.log
2015-10-01 12:33:38,242 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/Hadoop/.pigbootup not found
  
2015-10-01 12:33:39,630 [main]
INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
 
grunt>

Execute the Load statement

Now, by executing the following Pig Latin statement in the Grunt shell, the data student_data.txt file file is loaded into pig.

grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' 
   USING PigStorage(',')
   as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, 
   city:chararray );

The following is a description of the above instructions.

Relation name

We have stored the data in student mode.

Input file path

We read the data from the pig_data file in the student_data.txt/directory of HDFS.

Storage function

We used the PigStorage() function to load and store the data as a structured text file. /b11> It takes separators, separated by each entity of the meta-group as an argument. /b12> By default, it is parametered by the value of ""

schema

We have stored the data using the following pattern.

column	Id	Name	Surname	Phone number	City
datatype	Int	char array	char array	char array	char array

Note: The Load statement simply loads the data into pig's specified relationship. To verify the execution of the Load statement, you must use the Diagnostic operator, which is discussed in a later section.