Lets say i have 10gb of csv file and i want to get the summary statistics of the file using DataFrame describe method.

In this case first i need to create a DataFrame for all the 10gb csv data.




Does this mean all the 10gb will get loaded in to memory and calculate the statistics?

1 Answer

There is no limitation of size of file in pandas.read_csv method.

Use iterator=True and chunksize=xyz  for loading the giant csv file.

After that you can calculate  your statistics.

import pandas as pd

df = pd.read_csv('some_data.csv', iterator=True, chunksize=2000) # gives TextFileReader,which is iterable with chunks of 2000 rows.

partial_desc = df.describe()

After this, aggregate the info of all the partial describe.

