Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Apache Pig Cogroup operator


May 26, 2021 Apache Pig


Table of contents


The COGROUP operator works in the same way as the GROUP operator. /b10> The only difference between the two operators is that the group operator is typically used for a relationship, while the cogroup operator is used for statements that involve two or more relationships.

Use Cogroup to group two relationships

Suppose you have two files in the HDFS directory /pig_data /, namely student_details.txt and employee_details.txt, as shown below.

student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

employee_details.txt

001,Robin,22,newyork 
002,BOB,23,Kolkata 
003,Maya,23,Tokyo 
004,Sara,25,London 
005,David,23,Bhuwaneshwar 
006,Maggy,22,Chennai

Load these files into Pig, with the relationship names student_details and employee_details, as shown below.

grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
   as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); 
  
grunt> employee_details = LOAD 'hdfs://localhost:9000/pig_data/employee_details.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, city:chararray);

Now group the student_details/employee_details relationships by keywordage, as shown below.

grunt> cogroup_data = COGROUP student_details by age, employee_details by age;

Verify

Use the DUMP operator to validate the relationship cogroup_data, as shown below.

grunt> Dump cogroup_data;

Output

It produces the following output, showing the contents of cogroup_data called the file, as shown below.

(21,{(4,Preethi,Agarwal,21,9848022330,Pune), (1,Rajiv,Reddy,21,9848022337,Hyderabad)}, 
   {    })  
(22,{ (3,Rajesh,Khanna,22,9848022339,Delhi), (2,siddarth,Battacharya,22,9848022338,Kolkata) },  
   { (6,Maggy,22,Chennai),(1,Robin,22,newyork) })  
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)}, 
   {(5,David,23,Bhuwaneshwar),(3,Maya,23,Tokyo),(2,BOB,23,Kolkata)}) 
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)}, 
   { })  
(25,{   }, 
   {(4,Sara,25,London)})

The cogroup operator groups grouped groups from each relationship by age, each of which describes a specific age value.

For example, if we consider the first group of results, which are grouped by age 21, it contains two packages

  • The first package holds all the metagroups that have a 21-year-old first relationship (student_details the first relationship);

  • The second package has a second relationship (in this case, employee_details) for all metagroups, which are 21 years old.

If the relationship does not have a group with an age value of 21, an empty package is returned.