May 26, 2021 Apache Pig
In the last chapter, we learned how to load data into Apache Pig. Y ou can use the Store operator to store loaded data in the file system, and this chapter describes how to use the Store operator to store data in Apache Pig.
The syntax of the Store statement is given below.
STORE Relation_name INTO ' required_directory_path ' [USING function];
Let's say we have a file in HDFS that contains the following student_data.txt.
001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai.
Use the LOAD operator to read it into the relationship student, as shown below.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Now, let's store the relationship in the HDFS directory "/pig_Output/", as shown below.
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage (',');
After executing the store statement, you get the following output. /b10> Create a directory with the specified name and store the data in it.
2015-10-05 13:05:05,429 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. MapReduceLau ncher - 100% complete 2015-10-05 13:05:05,429 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.6.0 0.15.0 Hadoop 2015-10-0 13:03:03 2015-10-05 13:05:05 UNKNOWN Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime job_14459_06 1 0 n/a n/a n/a n/a MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature 0 0 0 0 student MAP_ONLY OutPut folder hdfs://localhost:9000/pig_Output/ Input(s): Successfully read 0 records from: "hdfs://localhost:9000/pig_data/student_data.txt" Output(s): Successfully stored 0 records in: "hdfs://localhost:9000/pig_Output" Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1443519499159_0006 2015-10-05 13:06:06,192 [main] INFO org.apache.pig.backend.hadoop.executionengine .mapReduceLayer.MapReduceLau ncher - Success!
You can verify the stored data as shown below.
First, use the ls command to list the files pig_output directory named " , as shown below.
hdfs dfs -ls 'hdfs://localhost:9000/pig_Output/' Found 2 items rw-r--r- 1 Hadoop supergroup 0 2015-10-05 13:03 hdfs://localhost:9000/pig_Output/_SUCCESS rw-r--r- 1 Hadoop supergroup 224 2015-10-05 13:03 hdfs://localhost:9000/pig_Output/part-m-00000
You can observe that two files were created after the store statement was executed.
Using the cat command, list the contents of a file named part-m-00000, as shown below.
$ hdfs dfs -cat 'hdfs://localhost:9000/pig_Output/part-m-00000' 1,Rajiv,Reddy,9848022337,Hyderabad 2,siddarth,Battacharya,9848022338,Kolkata 3,Rajesh,Khanna,9848022339,Delhi 4,Preethi,Agarwal,9848022330,Pune 5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 6,Archana,Mishra,9848022335,Chennai