Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)
I'm kinda new to Python and Datascience.

I have a 33gb CSV file Dataset, and I want to change it in DataFrame to do some stuff on it.

I tried to do it the 'Casual' with pandas.read_csv and it's taking ages to parse..

I searched on the internet and found this article.

It says that the most efficient way to read a large CSV file is to use CSV.DictReader.

So I tried to do that :

Can anyone tell me what's the most efficient way to parse a large dataset into pandas?

1 Answer

0 votes
by (36.8k points)
edited by

There is no way you can read such a big file in a short time. Anyway there are some strategies to deal with a large data, these are some of them which give u opportunity to implement your code without leaving the comfort of Pandas:

Sampling

Chunking

Optimising Pandas dtypes

Parallelising Pandas with Dask.

The most simple option is sampling your dataset(This may be helpful for you). Sometimes a random part ofa large dataset will already contain enough information to do next calculations. If u don't actually need to process your entire dataset this is excellent technique to use. sample code :

import pandas

import random

filename = "data.csv" 

n = sum(1 for line in open(filename)) - 1 # number of lines in file

s = n//m  # part of the data

skip = sorted(random.sample(range(1, n + 1), n - s))

df = pandas.read_csv(filename, skiprows=skip)

 Learn Data Science with Python Course to improve your technical knowledge.

Related questions

Browse Categories

...