Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

How to give more column conditions when joining two dataframes. For example I want to run the following :

val Lead_all = Leads.join(Utm_Master, 
    Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") ==
    Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),
"left")


I want to join only when these columns match. But above syntax is not valid as cols only takes one string. So how do I get what I want.

1 Answer

0 votes
by (32.3k points)

In Pyspark you can simply specify each condition separately:

val Lead_all = Leads.join(Utm_Master, (Leaddetails.LeadSource == Utm_Master.LeadSource) & (Leaddetails.Utm_Source == Utm_Master.Utm_Source) & (Leaddetails.Utm_Medium == Utm_Master.Utm_Medium) & (Leaddetails.Utm_Campaign == Utm_Master.Utm_Campaign))

Just be sure to use operators and parenthesis correctly.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

30.5k questions

32.6k answers

500 comments

108k users

Browse Categories

...