Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Apache Pig Join operator


May 26, 2021 Apache Pig


Table of contents


The JOIN operator is used to combine records from two or more relationships. /b10> When performing a connection operation, we declare one (or a group) of yuans as keys from each relationship. /b11> When these keys match, two specific groups match, otherwise the records are discarded. nect This can be the following type:

  • Self-join
  • Inner-join
  • Outer-join − left join, right join, and full join

This chapter describes an example of how to use the joy operator in Pig Latin. /b10> Suppose you have two files in the /pig_data/directory of HDFS, the customers .txt and the orders .txt, as shown below.

customers.txt

1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00 
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00

orders.txt

102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060

We load the two files into Pig with the customers and orders relationships, as shown below.

grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, address:chararray, salary:int);
  
grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',')
   as (oid:int, date:chararray, customer_id:int, amount:int);

Now let's perform various connection operations on these two relationships.

Self-join (self-connection)

Self-join is used to connect a table to itself, as if the table were two relationships, temporarily renaming at least one relationship. /b10> Typically, in Apache Pig, in order to execute self-join, we load the same data multiple times under different alias (name). /b11> Then, load the contents .txt file customers as two tables, as shown below.

grunt> customers1 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, address:chararray, salary:int);
  
grunt> customers2 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, address:chararray, salary:int); 

Grammar

The syntax for performing self-join operations using the JOIN operator is given below.

grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;

Cases

Self-join is performed on the relationship customers by adding two relationships, customers1 and customers2, as shown in the figure.

grunt> customers3 = JOIN customers1 BY id, customers2 BY id;

Verify

Use the DUMP operator to validate the relationship customers3, as shown below.

grunt> Dump customers3;

Output

The following output is produced to display the contents of the relationship customers.

(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)

Inner Join (Internal Connection)

Inner Join is used more frequently; /b11> When there is a match in both tables, the internal connection returns the row. /b12> Based on the join-predicate, create a new relationship by combining the column values of two relationships, such as A and B. /b13> The query compares each row of A to each row of B to find all the pairs of rows that satisfy the connection predicate. /b14> When the connection predicate is satisfied, the column values of each matching row pair of A and B are combined into result rows.

Grammar

The following is the syntax for performing inner join operations using the JOIN operator.

grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;

Cases

Let's do inner join for customers and orders, as shown below.

grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;

Verify

Use the DUMP operator to validate coustomer_orders relationship, as shown below.

grunt> Dump coustomer_orders;

Output

You get the following output, which is the content of coustomer_orders called the file.

(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

Note:

Outer Join: Unlike inner join, outer join returns all the rows in at least one relationship. The outer join operation is performed in three ways:

  • Left outer join
  • Right outer join
  • Full outer join

Left Outer Join (left outer connection)

The left outer join operation returns all rows in the left table, even if there is no match in the relationship on the right.

Grammar

The syntax for performing left outer join operations using the JOIN operator is given below.

grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;

Cases

Let's do the leaf outer join operation on the two relationships between customers and orders, as shown below.

grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;

Verify

Use the DUMP operator to validate the relationship outer_left, as shown below.

grunt> Dump outer_left;

Output

It produces the following output, which shows the outer_left relationship.

(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,) 

Right Outer Join (right outside connection)

The right outer join operation returns all rows in the right table, even if there are no matches in the left table.

Grammar

Here's the syntax for using the JOIN operator to perform the right outer join operation.

grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;

Cases

Let's do right outer join on customers and orders, as shown below.

grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;

Verify

Use the DUMP operator to validate the relationship outer_right, as shown below.

grunt> Dump outer_right

Output

It produces the following output, which shows the outer_right relationship.

(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

Full Outer Join (Full External Connection)

When a match exists in a relationship, the full outer join operation returns a row.

Grammar

Here's the syntax for using the JOIN operator to perform full outer join.

grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;

Cases

Let's do full outer join on customers and orders, as shown below.

grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;

Verify

Use the DUMP operator to validate the relationship outer_full, as shown below.

grun> Dump outer_full; 

Output

It produces the following output, which shows the outer_full relationship.

(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)

Use multiple Keys

We can use multiple keys to perform JOIN operations.

Grammar

Here's how to use multiple keys to perform JOIN operations on both tables.

grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1, key2);

Suppose you have two files in the /pig_data/directory of HDFS, .txt and employee_contact.txt, as shown below.

employee.txt

001,Rajiv,Reddy,21,programmer,003
002,siddarth,Battacharya,22,programmer,003
003,Rajesh,Khanna,22,programmer,003
004,Preethi,Agarwal,21,programmer,003
005,Trupthi,Mohanthy,23,programmer,003
006,Archana,Mishra,23,programmer,003
007,Komal,Nayak,24,teamlead,002
008,Bharathi,Nambiayar,24,manager,001

employee_contact.txt

001,9848022337,[email protected],Hyderabad,003
002,9848022338,[email protected],Kolkata,003
003,9848022339,[email protected],Delhi,003
004,9848022330,[email protected],Pune,003
005,9848022336,[email protected],Bhuwaneshwar,003
006,9848022335,[email protected],Chennai,003
007,9848022334,[email protected],trivendram,002
008,9848022333,[email protected],Chennai,001

Load the two files into Pig, using the relationships employee and employee_contact, as shown below.

grunt> employee = LOAD 'hdfs://localhost:9000/pig_data/employee.txt' USING PigStorage(',')
   as (id:int, firstname:chararray, lastname:chararray, age:int, designation:chararray, jobid:int);
  
grunt> employee_contact = LOAD 'hdfs://localhost:9000/pig_data/employee_contact.txt' USING PigStorage(',') 
   as (id:int, phone:chararray, email:chararray, city:chararray, jobid:int);

Now, let's use the JOIN operator to connect the contents of these two relationships, as shown below.

grunt> emp = JOIN employee BY (id,jobid), employee_contact BY (id,jobid);

Verify

Use the DUMP operator to validate the relationship emp, as shown below.

grunt> Dump emp; 

Output

It produces the following output, showing the contents of a relationship called emp, as shown below.

(1,Rajiv,Reddy,21,programmer,113,1,9848022337,[email protected],Hyderabad,113)
(2,siddarth,Battacharya,22,programmer,113,2,9848022338,[email protected],Kolka ta,113)  
(3,Rajesh,Khanna,22,programmer,113,3,9848022339,[email protected],Delhi,113)  
(4,Preethi,Agarwal,21,programmer,113,4,9848022330,[email protected],Pune,113)  
(5,Trupthi,Mohanthy,23,programmer,113,5,9848022336,[email protected],Bhuwaneshw ar,113)  
(6,Archana,Mishra,23,programmer,113,6,9848022335,[email protected],Chennai,113)  
(7,Komal,Nayak,24,teamlead,112,7,9848022334,[email protected],trivendram,112)  
(8,Bharathi,Nambiayar,24,manager,111,8,9848022333,[email protected],Chennai,111)