0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

How to give more column conditions when joining two dataframes. For example I want to run the following :

val Lead_all = Leads.join(Utm_Master, 
    Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") ==
    Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),
"left")


I want to join only when these columns match. But above syntax is not valid as cols only takes one string. So how do I get what I want.

1 Answer

0 votes
by (32.5k points)

In Pyspark you can simply specify each condition separately:

val Lead_all = Leads.join(Utm_Master, (Leaddetails.LeadSource == Utm_Master.LeadSource) & (Leaddetails.Utm_Source == Utm_Master.Utm_Source) & (Leaddetails.Utm_Medium == Utm_Master.Utm_Medium) & (Leaddetails.Utm_Campaign == Utm_Master.Utm_Campaign))

Just be sure to use operators and parenthesis correctly.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...