2 views

I am performing data manipulation in python using pandas on a very large dataset, say 100 million rows. I have two dataframes and wish to generate third dataframe as per the conditions mentioned, the scenario is explained below:

dataframe 1:

Col_B and Col_D are of int64 type

Col_A   Col_B   Col_C   Col_D

A       11      B       20

A       11      C       24

B       14      R       32

...      ...    ...      ...

dataframe 2:

Col_Z is of type float64 and remaining columns are of int64

Col_X   Col_Y   Col_P   Col_Q   Col_Z

10      15      16      21      0.99

10      15      17      22      0.89

...     ...     ...     ...      ...

...     ...     ...     ...      ...

11      15      30      35      0.67

...     ...     ...     ...      ...

Condition to be applied: Consider only first row of both the dataframe, for the sake of understanding conditions:

if the value of (Col_B is between the value of Col_X and Col_Y) and value of (Col_D is between the value of Col_P and Col_Q) then return the corresponding value of Col_A, Col_C and Col_Z, otherwise return NaN

Expected Output (Dataframe 3):

Col_A   Col_C   Col_Z

A       B       0.99

NaN     NaN      NaN

B       R       0.67

Note: This output is generated merely considering if there are only these three rows in dataframes but in actual each value of Dataframe 1 has to scan all of the values in Dataframe 2 until desired conditions is achieved.

My Code:

df3 = {}

Col_A = []

Col_C = []

Col_Z = []

for i in df1.iterrows():

value = float(df2[(i > df2['Col_X'].values) &

(i < df2['Col_Y'].values) &

(i > df2['Col_P'].values) &

(i < df2['Col_Q'].values)]['Col_Z'])

if bool(value):

Col_Z.append(value)

Col_A.append(i)

Col_C.append(i)

else:

Col_Z.append(float('NaN'))

Col_A.append(float('NaN'))

Col_C.append(float('NaN'))

This code is working fine uptill the condition is met, as soon as condition does'nt met, it throws a TypeError. Please can any rectify this.

Also, I wanted to know if there is any alternate and efficient way to perform it. Please let me know.

by (41.4k points)

This code will work fine:

df3 = {}

Col_A = []

Col_C = []

Col_Z = []

for i in df1.iterrows():

value = df2[(i > df2['Col_X'].values) &

(i < df2['Col_Y'].values) &

(i > df2['Col_P'].values) &

(i < df2['Col_Q'].values)]['Col_Z']

if value.empty:

Col_Z.append(float('NaN'))

Col_A.append(float('NaN'))

Col_C.append(float('NaN'))

else:

Col_Z.append(value)

Col_A.append(i)

Col_C.append(i)

df3['A'] = Col_A

df3['C'] = Col_C

df3['Z'] = Col_Z

df3 = pd.DataFrame(df3)