Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (50.2k points)

I have used rosetta.parallel.pandas_easy to parallelize apply after group by, for example:

from rosetta.parallel.pandas_easy import groupby_to_series_to_frame

df = pd.DataFrame({'a': [6, 2, 2], 'b': [4, 5, 6]},index= ['g1', 'g1', 'g2'])

groupby_to_series_to_frame(df, np.mean, n_jobs=8, use_apply=True, by=df.index)

However, has anyone figured out how to parallelize a function that returns a dataframe? This code fails for rosetta, as expected.

def tmpFunc(df):

    df['c'] = df.a + df.b

    return df

df.groupby(df.index).apply(tmpFunc)

groupby_to_series_to_frame(df, tmpFunc, n_jobs=1, use_apply=True, by=df.index)

1 Answer

0 votes
by (108k points)

The following code you can try as it is not dependent on joblib and this works for me: 

from multiprocessing import Pool, cpu_count

def applyParallel(dfGrouped, func):

    with Pool(cpu_count()) as p:

        ret_list = p.map(func, [group for name, group in dfGrouped])

    return pandas.concat(ret_list)

This can not replace any groupby.apply(), but it will cover the typical cases: e.g. it should cover cases 2 and 3 in the following link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.apply.html

And you can also obtain the behavior of case 1 by giving the argument axis=1 to the final pandas.concat() call.

Browse Categories

...