Explore Courses Blog Tutorials Interview Questions
0 votes
in Data Science by (17.6k points)

I want to apply a function row-wise to a dataframe that looks like this:

name  value 

'foo' 2

'bar' 4

'bar' 3

'foo' 1

  .   .

  .   .

  .   .

'bar' 8

Speed is important to me since I am operating on multiple 90GB datasets, so I have been attempting to vectorize the following operation for use in df.apply:

Conditioned on the 'name', I want to plug 'value' into a separate function, perform some arithmetic on the result, and write to a new column 'output'. Something like,

funcs = {'foo': <FunctionObject>, 'bar': <FunctionObject>}

def masterFunc(row):

    correctFunction = funcs[row['name']]

    row['output'] = correctFunction(row['value']) + 3*row['value']

df.apply(masterFunc, axis=1).

In my real problem, I have 32 different functions that could apply to the 'value' based on the 'name'. Each of those individual functions (fooFunc, barFunc, zooFunc, etc) are already vectorized; they are scipy.interp1d functions built like this:

separateFunc = scipy.interpolate.interp1d(x-coords=[2, 3, 4], y-coords=[3, 5, 7])

#separateFunc is now a math function, y=2x-1. use case:

y = separateFunc(3.5) # y == 6

However, I am not sure how I can vectorize the masterFunc itself. It seems like choosing which function to 'pull out' to apply to the 'value' is very expensive, because it requires a memory access at each iteration (with my current method of storing the functions in hashtables). However, the alternative just seems to be a bunch of if-then statements, which also seems impossible to vectorize. How can I speed this up?

Actual code, with repetitive parts removed for brevity:

interpolationFunctions = {}

#the 'interpolate.emissionsFunctions' are a separate function which does some scipy stuff

interpolationFunctions[2] = interpolate.emissionsFunctions('./roadtype_2_curve.csv')

interpolationFunctions[3] = interpolate.emissionsFunctions('./roadtype_3_curve.csv')

def compute_pollutants(row):

    funcs = interpolationFunctions[row['roadtype']]

    speed = row['speed']

    length = row['length']

    row['CO2-Atm'] = funcs['CO2-Atm'](speed)*length*speed*0.00310686368

    row['CO2-Eq'] = funcs['CO2-Eq'](speed)*length*speed*0.00310686368

    return row

1 Answer

0 votes
by (41.4k points)

def foo2(x, y):

    return x + y

import time as t

# Sample Functions

Refer to the code below,you can run it with different row sizes to compare the results between different methods.

import numpy as np

import pandas as pd

def foo(x):

    return x + x

def bar(x):

    return x * x

# Sample Functions for multiple columns

def bar2(x, y):

    return x * y

# Now,create function dictionary

funcs = {'foo': foo, 'bar': bar}

funcs2 = {'foo': foo2, 'bar': bar2}

n_rows = 1000000

# Generate Sample Data

names = np.random.choice(list(funcs.keys()), size=n_rows)

values = np.random.normal(100, 20, size=n_rows)

df = pd.DataFrame()

df['name'] = names

df['value'] = values

# Create copy for comparison using different methods

df_copy = df.copy()

# Modified original master function

def masterFunc(row, functs):

    correctFunction = funcs[row['name']]

    return correctFunction(row['value']) + 3*row['value']

t1 = t.time()

df['output'] = df.apply(lambda x: masterFunc(x, funcs), axis=1)

t2 = t.time()

print("Time for all rows/functions: ", t2 - t1)

# For Functions that Can be vectorized using numpy

t3 = t.time()

output_dataframe_list = []

for func_name, func in funcs.items():

    df_subset = df_copy.loc[df_copy['name'] == func_name,:]

    df_subset['output'] = func(df_subset['value'].values) + 3 * df_subset['value'].values


output_df = pd.concat(output_dataframe_list)

t4 = t.time()

print("Time for all rows/functions: ", t4 - t3)

# Using a for loop over numpy array of values is still faster than dataframe apply using

t5 = t.time()

output_dataframe_list2 = []

for func_name, func in funcs2.items():

    df_subset = df_copy.loc[df_copy['name'] == func_name,:]

    col1_values = df_subset['value'].values

    outputs = np.zeros(len(col1_values))

    for i, v in enumerate(col1_values):

        outputs[i] = func(col1_values[i], col1_values[i]) + 3 * col1_values[i]

    df_subset['output'] = np.array(outputs)


output_df2 = pd.concat(output_dataframe_list2)

t6 = t.time()

print("Time for all rows/functions: ", t6 - t5)

If you wish to learn about pandas refer to this Pandas Tutorial.

Browse Categories