Why pandas chunks are behaving differently then actual dataframe?

Question

asked Jul 29, 2019 in Python by Rajesh Malhotra (19.9k points)

I want to process a CSV file present on my local hard disk in chunks using pandas. I have the processing code ready and it works without any error if I ran the code on a whole dataset. The problem arises when the same code is run on the chunks.

I thought maybe the chunks are of different data types so tried checking the type of chunks using type(chunk) and it is the same as type(whole_dataframe).

What I tried:

whole_data = pd.read_csv('data.csv', sep=',', header=0)
whole_data['cuisines'] = whole_data.cuisines.apply(lambda x: ','+x)

This gives me the expected result. But when I try running the same code on chunks as:

for chunk in pd.read_csv('data.csv', sep=',', header=0, chunksize=1000):
chunk['cuisines'] = chunk.cuisines.apply(lambda x: ','+x)

This gives me an error: TypeError: can only concatenate str (not "float") to str

I expect the output to be the same as output I got while running the code on the whole dataset.

1 Answer

Anirudh Singh · Answer 1 · 2019-07-29T13:16:17+0000

You need to convert those chunks into string. You can do it like this:

for chunk in pd.read_csv('data.csv', sep=',', header=0, chunksize=1000):
chunk['cuisines'] = ',' + chunk.cuisines.astype(str).str

Why pandas chunks are behaving differently then actual dataframe?

1 Answer

Related questions

Browse Categories