parallelize pandas column update

Question

asked Jul 29, 2019 in Python by Rajesh Malhotra (19.9k points)

I need to update the column of pandas dataframe based on the processing of a list of selected values (df0['parcels'].values in code below). The code works well but is long because the list of selected values is rather long with 45000 values. This code needs 5 hours to complete the task.

As processing on each selected value is independant. I would like to try to parallelize it for improving the speed.

import numpy as np
import pandas as pd
from scipy.ndimage import distance_transform_edt as edt
for i in df0['parcels'].values:
y, x = np.where(parcels == i)
tmp = parcels[np.min(y) - 5:np.max(y) + 6, np.min(x) - 5:np.max(x) + 6]
dst = edt(tmp, sampling=r_parcels)
par = tmp[dst <= 20]
par = par[par != -9999]
mod, cnt = ss.mode(par)
df['parcels'] = df['parcels'].replace(i, mod[0])

1 Answer

Related questions

0 votes

1 answer

Parallelize apply after pandas groupby

asked Sep 12, 2019 in Data Science by ashely (50.2k points)

0 votes

1 answer

Update a dataframe in pandas while iterating row by row

asked Jul 27, 2019 in Data Science by sourav (17.6k points)

0 votes

0 answers

creating column for str.contain in pandas

asked Jan 14, 2020 in Python by parajf (120 points)

0 votes

1 answer

Pandas: Replace column if 0 if it is not the max

asked Jul 30, 2019 in Python by Rajesh Malhotra (19.9k points)

0 votes

1 answer

change all pandas dataframe cell in a column with an array

asked Jul 29, 2019 in Python by Rajesh Malhotra (19.9k points)

Anirudh Singh · Answer 1 · 2019-07-29T12:13:14+0000

You can use the Pool method from the multiprocessing package.

import numpy as np
import pandas as pd
from scipy.ndimage import distance_transform_edt as edt
import multiprocessing as mp
def func(i): # change the body of the loop to function
y, x = np.where(parcels == i)
tmp = parcels[np.min(y) - 5:np.max(y) + 6, np.min(x) - 5:np.max(x) + 6]
dst = edt(tmp, sampling=r_parcels)
par = tmp[dst <= 20]
par = par[par != -9999]
mod, cnt = ss.mode(par)
return (df['parcels'].replace(i, mod[0]))
num_workers = mp.cpu_count()
pool = mp.Pool(num_workers)
df['parcels'] = pool.map(func,df0['parcels'].values) # specify the function and arguments to map
pool.close()
pool.join()

parallelize pandas column update

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources