Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (19k points)
Please explain the difference between cogroup and full outer join in spark.

1 Answer

0 votes
by (33.1k points)

Full Outer Join

  • Full outer joins in RDD is the same as full outer join in SQL.
  • FULL JOIN returns all matching records from both tables whether the other table matches or not.
  • FULL JOIN can potentially return very large datasets.
  • FULL JOIN and FULL OUTER JOIN are the same.
  • Also Please go through the below link it had detailed explanation for the full joins.

Group and Co-group

  • The GROUP and COGROUP operators are identical but GROUP is used in statements involving one relation and COGROUP is used in statements involving two or more relations.

For example:

A = load 'student' AS (name:chararray,age:int,gpa:float);

DUMP A;

(John,18,4.0F)

(Mary,19,3.8F)

(Bill,20,3.9F)

(Joe,18,3.8F)

B = GROUP A BY age;

DUMP B;

(18,{(John,18,4.0F),(Joe,18,3.8F)})

(19,{(Mary,19,3.8F)})

(20,{(Bill,20,3.9F)})

Now we are using Cogroup

A = LOAD 'data1' AS (owner:chararray,pet:chararray);

DUMP A;

(Alice,turtle)

(Alice,goldfish)

(Alice,cat)

(Bob,dog)

(Bob,cat)

B = LOAD 'data2' AS (friend1:chararray,friend2:chararray);

DUMP B;

(Cindy,Alice)

(Mark,Alice)

(Paul,Bob)

(Paul,Jane)

X = COGROUP A BY owner, B BY friend2;

dump X;

(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})

(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})

(Jane,{},{(Paul,Jane)})

In the above example, the first bag is the tuples from the first relation with the matching key field. The second bag is the tuples from the second relation with the matching key field. If no tuples match the key field, the bag is empty.

I hope this answer would help you!

Browse Categories

...