Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Apache Pig Distinct operator


May 26, 2021 Apache Pig


Table of contents


The DISTINCT operator is used to remove redundant (duplicate) fuses from relationships.

Grammar

The syntax of the DISTINCT operator is given below.

grunt> Relation_name2 = DISTINCT Relatin_name1;

Cases

Suppose you have a file called pig_data in the HDFS directory /student_details.txt, as shown below.

student_details.txt

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata 
002,siddarth,Battacharya,9848022338,Kolkata 
003,Rajesh,Khanna,9848022339,Delhi 
003,Rajesh,Khanna,9848022339,Delhi 
004,Preethi,Agarwal,9848022330,Pune 
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai 
006,Archana,Mishra,9848022335,Chennai

The relationship student_details load this file into Pig, as shown below.

grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') 
   as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);

Now, let's use the DISTINCT operator to remove redundant (duplicate) marshals from the student_details relationship and have another relationship called distinct_data as shown below.

grunt> distinct_data = DISTINCT student_details;

Verify

Use the DUMP operator to validate the relationship distinct_data, as shown below.

grunt> Dump distinct_data;

Output

It produces the following output, showing the relationship distinct_data as follows.

(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata) 
(3,Rajesh,Khanna,9848022339,Delhi) 
(4,Preethi,Agarwal,9848022330,Pune) 
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)