Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Python by (12.7k points)
edited by

I have a huge csv file, which is 600 mb in size with 11 million rows. With this file, I want to create some statistical data like pivots, histograms, graphs etc. Just read the csv file normally:

df = pd.read_csv('Check400_900.csv', sep='\t')

But it doesn't work, so I used iterator and chunksize.

df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)

It was good, example "print df.get_chunk(5)" and explore the entire file with just

for chunk in df:

    print chunk

Here is my problem arose, I just don't know how to use stuff like below code for the entire df and not for just one chunk

plt.plot()

print df.head()

print df.describe()

print df.dtypes

customer_group3 = df.groupby('UserID')

y3 = customer_group.size()

1 Answer

0 votes
by (26.4k points)

You can create one huge Dataframe. Then you can use concat for all chunks into df, because type of output of the function:

df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)

tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)

print tp

#<pandas.io.parsers.TextFileReader object at 0x00000000150E0048>

df = pd.concat(tp, ignore_index=True)

I think it's important to add parameter ignore index to the method concat, The reason is to avoid duplicity of indexes. 

Want to know more information about python? Come and Join: python certification course

Browse Categories

...