Explore Courses Blog Tutorials Interview Questions
0 votes
in Data Science by (17.6k points)

Lets say i have 10gb of csv file and i want to get the summary statistics of the file using DataFrame describe method.

In this case first i need to create a DataFrame for all the 10gb csv data.




Does this mean all the 10gb will get loaded in to memory and calculate the statistics?

1 Answer

0 votes
by (41.4k points)
edited by

There is no limitation of size of file in pandas.read_csv method.

Use iterator=True and chunksize=xyz  for loading the giant csv file.

After that you can calculate  your statistics.

import pandas as pd

df = pd.read_csv('some_data.csv', iterator=True, chunksize=2000) # gives TextFileReader,which is iterable with chunks of 2000 rows.

partial_desc = df.describe()

After this, aggregate the info of all the partial describe.

If you want to learn statistics for Data Science then you can watch this video tutorial:

Gain practical exposure with data science projects in Intellipaat's Data Science course online.

Browse Categories