Explore Courses Blog Tutorials Interview Questions
0 votes
in Data Science by (17.6k points)

Lets say i have 10gb of csv file and i want to get the summary statistics of the file using DataFrame describe method.

In this case first i need to create a DataFrame for all the 10gb csv data.




Does this mean all the 10gb will get loaded in to memory and calculate the statistics?

1 Answer

0 votes
by (41.4k points)
edited by

There is no limitation of size of file in pandas.read_csv method.

Use iterator=True and chunksize=xyz  for loading the giant csv file.

After that you can calculate  your statistics.

import pandas as pd

df = pd.read_csv('some_data.csv', iterator=True, chunksize=2000) # gives TextFileReader,which is iterable with chunks of 2000 rows.

partial_desc = df.describe()

After this, aggregate the info of all the partial describe.

If you want to learn statistics for Data Science then you can watch this video tutorial:

Gain practical exposure with data science projects in Intellipaat's Data Science course online.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

30.5k questions

32.6k answers


108k users

Browse Categories