0 votes
1 view
in Data Science by (17.6k points)

Lets say i have 10gb of csv file and i want to get the summary statistics of the file using DataFrame describe method.

In this case first i need to create a DataFrame for all the 10gb csv data.

text_csv=Pandas.read_csv("target.csv")

df=Pandas.DataFrame(text_csv)

df.describe()

Does this mean all the 10gb will get loaded in to memory and calculate the statistics?

1 Answer

0 votes
by (38.4k points)
edited by

There is no limitation of size of file in pandas.read_csv method.

Use iterator=True and chunksize=xyz  for loading the giant csv file.

After that you can calculate  your statistics.

import pandas as pd

df = pd.read_csv('some_data.csv', iterator=True, chunksize=2000) # gives TextFileReader,which is iterable with chunks of 2000 rows.

partial_desc = df.describe()

After this, aggregate the info of all the partial describe.

If you want to learn statistics for Data Science then you can watch this video tutorial:

Gain practical exposure with data science projects in Intellipaat's Data Science course online.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...