Python 33gb csv file Dataset to Pandas DataFrame

Question

1 Answer

supriya · Answer 1 · 2020-06-03T07:48:25+0000

There is no way you can read such a big file in a short time. Anyway there are some strategies to deal with a large data, these are some of them which give u opportunity to implement your code without leaving the comfort of Pandas:

Sampling
Chunking
Optimising Pandas dtypes
Parallelising Pandas with Dask.

The most simple option is sampling your dataset(This may be helpful for you). Sometimes a random part ofa large dataset will already contain enough information to do next calculations. If u don't actually need to process your entire dataset this is excellent technique to use. sample code :

import pandas
import random
filename = "data.csv"
n = sum(1 for line in open(filename)) - 1 # number of lines in file
s = n//m # part of the data
skip = sorted(random.sample(range(1, n + 1), n - s))
df = pandas.read_csv(filename, skiprows=skip)

Learn Data Science with Python Course to improve your technical knowledge.

Python 33gb csv file Dataset to Pandas DataFrame

1 Answer

Related questions

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources