Speed up or vectorize pandas apply function - require a conditional application of a function

Question

asked Jul 10, 2019 in Data Science by sourav (17.6k points)

I want to apply a function row-wise to a dataframe that looks like this:

name value
'foo' 2
'bar' 4
'bar' 3
'foo' 1
. .
. .
. .
'bar' 8

Speed is important to me since I am operating on multiple 90GB datasets, so I have been attempting to vectorize the following operation for use in df.apply:

Conditioned on the 'name', I want to plug 'value' into a separate function, perform some arithmetic on the result, and write to a new column 'output'. Something like,

funcs = {'foo': <FunctionObject>, 'bar': <FunctionObject>}
def masterFunc(row):
correctFunction = funcs[row['name']]
row['output'] = correctFunction(row['value']) + 3*row['value']
df.apply(masterFunc, axis=1).

In my real problem, I have 32 different functions that could apply to the 'value' based on the 'name'. Each of those individual functions (fooFunc, barFunc, zooFunc, etc) are already vectorized; they are scipy.interp1d functions built like this:

separateFunc = scipy.interpolate.interp1d(x-coords=[2, 3, 4], y-coords=[3, 5, 7])
#separateFunc is now a math function, y=2x-1. use case:
y = separateFunc(3.5) # y == 6

However, I am not sure how I can vectorize the masterFunc itself. It seems like choosing which function to 'pull out' to apply to the 'value' is very expensive, because it requires a memory access at each iteration (with my current method of storing the functions in hashtables). However, the alternative just seems to be a bunch of if-then statements, which also seems impossible to vectorize. How can I speed this up?

Actual code, with repetitive parts removed for brevity:

interpolationFunctions = {}
#the 'interpolate.emissionsFunctions' are a separate function which does some scipy stuff
interpolationFunctions[2] = interpolate.emissionsFunctions('./roadtype_2_curve.csv')
interpolationFunctions[3] = interpolate.emissionsFunctions('./roadtype_3_curve.csv')
def compute_pollutants(row):
funcs = interpolationFunctions[row['roadtype']]
speed = row['speed']
length = row['length']
row['CO2-Atm'] = funcs['CO2-Atm'](speed)*length*speed*0.00310686368
row['CO2-Eq'] = funcs['CO2-Eq'](speed)*length*speed*0.00310686368
return row

1 Answer

Shlok Pandey · Answer 1 · 2019-07-15T06:18:02+0000

def foo2(x, y):
return x + y

import time as t
# Sample Functions

Refer to the code below,you can run it with different row sizes to compare the results between different methods.

import numpy as np

import pandas as pd

def foo(x):
return x + x

def bar(x):
return x * x
# Sample Functions for multiple columns

def bar2(x, y):
return x * y
# Now,create function dictionary
funcs = {'foo': foo, 'bar': bar}
funcs2 = {'foo': foo2, 'bar': bar2}
n_rows = 1000000
# Generate Sample Data
names = np.random.choice(list(funcs.keys()), size=n_rows)
values = np.random.normal(100, 20, size=n_rows)
df = pd.DataFrame()
df['name'] = names
df['value'] = values
# Create copy for comparison using different methods
df_copy = df.copy()
# Modified original master function
def masterFunc(row, functs):
correctFunction = funcs[row['name']]
return correctFunction(row['value']) + 3*row['value']
t1 = t.time()
df['output'] = df.apply(lambda x: masterFunc(x, funcs), axis=1)
t2 = t.time()
print("Time for all rows/functions: ", t2 - t1)
# For Functions that Can be vectorized using numpy
t3 = t.time()
output_dataframe_list = []
for func_name, func in funcs.items():
df_subset = df_copy.loc[df_copy['name'] == func_name,:]
df_subset['output'] = func(df_subset['value'].values) + 3 * df_subset['value'].values
output_dataframe_list.append(df_subset)
output_df = pd.concat(output_dataframe_list)
t4 = t.time()
print("Time for all rows/functions: ", t4 - t3)
# Using a for loop over numpy array of values is still faster than dataframe apply using
t5 = t.time()
output_dataframe_list2 = []
for func_name, func in funcs2.items():
df_subset = df_copy.loc[df_copy['name'] == func_name,:]
col1_values = df_subset['value'].values
outputs = np.zeros(len(col1_values))
for i, v in enumerate(col1_values):
outputs[i] = func(col1_values[i], col1_values[i]) + 3 * col1_values[i]
df_subset['output'] = np.array(outputs)
output_dataframe_list2.append(df_subset)
output_df2 = pd.concat(output_dataframe_list2)
t6 = t.time()
print("Time for all rows/functions: ", t6 - t5)

If you wish to learn about pandas refer to this Pandas Tutorial.

Speed up or vectorize pandas apply function - require a conditional application of a function

1 Answer

Related questions

Browse Categories