May 26, 2021 Apache Pig
1. Self-join (self-connection)
2. Inner Join (Internal Connection)
3. Left Outer Join (left outer connection)
4. Right Outer Join (right outside connection)
The JOIN operator is used to combine records from two or more relationships. /b10> When performing a connection operation, we declare one (or a group) of yuans as keys from each relationship. /b11> When these keys match, two specific groups match, otherwise the records are discarded. nect This can be the following type:
This chapter describes an example of how to use the joy operator in Pig Latin. /b10> Suppose you have two files in the /pig_data/directory of HDFS, the customers .txt and the orders .txt, as shown below.
customers.txt
1,Ramesh,32,Ahmedabad,2000.00 2,Khilan,25,Delhi,1500.00 3,kaushik,23,Kota,2000.00 4,Chaitali,25,Mumbai,6500.00 5,Hardik,27,Bhopal,8500.00 6,Komal,22,MP,4500.00 7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000 100,2009-10-08 00:00:00,3,1500 101,2009-11-20 00:00:00,2,1560 103,2008-05-20 00:00:00,4,2060
We load the two files into Pig with the customers and orders relationships, as shown below.
grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',') as (oid:int, date:chararray, customer_id:int, amount:int);
Now let's perform various connection operations on these two relationships.
Self-join is used to connect a table to itself, as if the table were two relationships, temporarily renaming at least one relationship. /b10> Typically, in Apache Pig, in order to execute self-join, we load the same data multiple times under different alias (name). /b11> Then, load the contents .txt file customers as two tables, as shown below.
grunt> customers1 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> customers2 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray, salary:int);
The syntax for performing self-join operations using the JOIN operator is given below.
grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
Self-join is performed on the relationship customers by adding two relationships, customers1 and customers2, as shown in the figure.
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
Use the DUMP operator to validate the relationship customers3, as shown below.
grunt> Dump customers3;
The following output is produced to display the contents of the relationship customers.
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000) (2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500) (3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000) (4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500) (5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500) (6,Komal,22,MP,4500,6,Komal,22,MP,4500) (7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)
Inner Join is used more frequently; /b11> When there is a match in both tables, the internal connection returns the row. /b12> Based on the join-predicate, create a new relationship by combining the column values of two relationships, such as A and B. /b13> The query compares each row of A to each row of B to find all the pairs of rows that satisfy the connection predicate. /b14> When the connection predicate is satisfied, the column values of each matching row pair of A and B are combined into result rows.
The following is the syntax for performing inner join operations using the JOIN operator.
grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
Let's do inner join for customers and orders, as shown below.
grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;
Use the DUMP operator to validate coustomer_orders relationship, as shown below.
grunt> Dump coustomer_orders;
You get the following output, which is the content of coustomer_orders called the file.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Note:
Outer Join: Unlike inner join, outer join returns all the rows in at least one relationship. The outer join operation is performed in three ways:
The left outer join operation returns all rows in the left table, even if there is no match in the relationship on the right.
The syntax for performing left outer join operations using the JOIN operator is given below.
grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;
Let's do the leaf outer join operation on the two relationships between customers and orders, as shown below.
grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;
Use the DUMP operator to validate the relationship outer_left, as shown below.
grunt> Dump outer_left;
It produces the following output, which shows the outer_left relationship.
(1,Ramesh,32,Ahmedabad,2000,,,,) (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) (5,Hardik,27,Bhopal,8500,,,,) (6,Komal,22,MP,4500,,,,) (7,Muffy,24,Indore,10000,,,,)
The right outer join operation returns all rows in the right table, even if there are no matches in the left table.
Here's the syntax for using the JOIN operator to perform the right outer join operation.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
Let's do right outer join on customers and orders, as shown below.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
Use the DUMP operator to validate the relationship outer_right, as shown below.
grunt> Dump outer_right
It produces the following output, which shows the outer_right relationship.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
When a match exists in a relationship, the full outer join operation returns a row.
Here's the syntax for using the JOIN operator to perform full outer join.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
Let's do full outer join on customers and orders, as shown below.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
Use the DUMP operator to validate the relationship outer_full, as shown below.
grun> Dump outer_full;
It produces the following output, which shows the outer_full relationship.
(1,Ramesh,32,Ahmedabad,2000,,,,) (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) (5,Hardik,27,Bhopal,8500,,,,) (6,Komal,22,MP,4500,,,,) (7,Muffy,24,Indore,10000,,,,)
We can use multiple keys to perform JOIN operations.
Here's how to use multiple keys to perform JOIN operations on both tables.
grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1, key2);
Suppose you have two files in the /pig_data/directory of HDFS, .txt and employee_contact.txt, as shown below.
employee.txt
001,Rajiv,Reddy,21,programmer,003 002,siddarth,Battacharya,22,programmer,003 003,Rajesh,Khanna,22,programmer,003 004,Preethi,Agarwal,21,programmer,003 005,Trupthi,Mohanthy,23,programmer,003 006,Archana,Mishra,23,programmer,003 007,Komal,Nayak,24,teamlead,002 008,Bharathi,Nambiayar,24,manager,001
employee_contact.txt
001,9848022337,[email protected],Hyderabad,003 002,9848022338,[email protected],Kolkata,003 003,9848022339,[email protected],Delhi,003 004,9848022330,[email protected],Pune,003 005,9848022336,[email protected],Bhuwaneshwar,003 006,9848022335,[email protected],Chennai,003 007,9848022334,[email protected],trivendram,002 008,9848022333,[email protected],Chennai,001
Load the two files into Pig, using the relationships employee and employee_contact, as shown below.
grunt> employee = LOAD 'hdfs://localhost:9000/pig_data/employee.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, designation:chararray, jobid:int); grunt> employee_contact = LOAD 'hdfs://localhost:9000/pig_data/employee_contact.txt' USING PigStorage(',') as (id:int, phone:chararray, email:chararray, city:chararray, jobid:int);
Now, let's use the JOIN operator to connect the contents of these two relationships, as shown below.
grunt> emp = JOIN employee BY (id,jobid), employee_contact BY (id,jobid);
Use the DUMP operator to validate the relationship emp, as shown below.
grunt> Dump emp;
It produces the following output, showing the contents of a relationship called emp, as shown below.
(1,Rajiv,Reddy,21,programmer,113,1,9848022337,[email protected],Hyderabad,113) (2,siddarth,Battacharya,22,programmer,113,2,9848022338,[email protected],Kolka ta,113) (3,Rajesh,Khanna,22,programmer,113,3,9848022339,[email protected],Delhi,113) (4,Preethi,Agarwal,21,programmer,113,4,9848022330,[email protected],Pune,113) (5,Trupthi,Mohanthy,23,programmer,113,5,9848022336,[email protected],Bhuwaneshw ar,113) (6,Archana,Mishra,23,programmer,113,6,9848022335,[email protected],Chennai,113) (7,Komal,Nayak,24,teamlead,112,7,9848022334,[email protected],trivendram,112) (8,Bharathi,Nambiayar,24,manager,111,8,9848022333,[email protected],Chennai,111)