Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Python by (19.9k points)

I need to update the column of pandas dataframe based on the processing of a list of selected values (df0['parcels'].values in code below). The code works well but is long because the list of selected values is rather long with 45000 values. This code needs 5 hours to complete the task.

As processing on each selected value is independant. I would like to try to parallelize it for improving the speed.

import numpy as np

import pandas as pd

from scipy.ndimage import distance_transform_edt as edt

for i in df0['parcels'].values:

    y, x = np.where(parcels == i)

    tmp = parcels[np.min(y) - 5:np.max(y) + 6, np.min(x) - 5:np.max(x) + 6]

    dst = edt(tmp, sampling=r_parcels)

    par = tmp[dst <= 20]

    par = par[par != -9999]

    mod, cnt = ss.mode(par)

    df['parcels'] = df['parcels'].replace(i, mod[0])

1 Answer

0 votes
by (25.1k points)

You can use the Pool method from the multiprocessing package.

import numpy as np

import pandas as pd

from scipy.ndimage import distance_transform_edt as edt

import multiprocessing as mp

def func(i): # change the body of the loop to function

    y, x = np.where(parcels == i)

    tmp = parcels[np.min(y) - 5:np.max(y) + 6, np.min(x) - 5:np.max(x) + 6]

    dst = edt(tmp, sampling=r_parcels)

    par = tmp[dst <= 20]

    par = par[par != -9999]

    mod, cnt = ss.mode(par)

    return (df['parcels'].replace(i, mod[0]))

num_workers = mp.cpu_count()  

pool = mp.Pool(num_workers)

df['parcels'] = pool.map(func,df0['parcels'].values) # specify the function and arguments to map 

pool.close()

pool.join()

Related questions

0 votes
1 answer
0 votes
0 answers
0 votes
1 answer

Browse Categories

...