0 votes
1 view
in Data Science by (17.6k points)

Given is a 1.5 Gb list of pandas dataframes.

I am wondering which is a better approach to handle loading this data: pickle (via cPickle), hdf5, or something else in python?

First, "dumping" the data is OK to take long, I only do this once.

I am also not concerned with file size on disk.

Question: What I am concerned about is the speed of loading the data into memory as quickly as possible.

1 Answer

0 votes
by (38.2k points)

I would consider only two storage formats: HDF5 (PyTables) and Feather

Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).

Comparison for the following storage formats: (CSV, CSV.gzip, Pickle, HDF5 [various compression]):

                  read_s  write_s  size_ratio_to_CSV

storage

CSV               17.900    69.00              1.000

CSV.gzip          18.900   186.00              0.047

Pickle             0.173     1.77              0.374

HDF_fixed          0.196     2.03              0.435

HDF_tab            0.230     2.60              0.437

HDF_tab_zlib_c5    0.845     5.44              0.035

HDF_tab_zlib_c9    0.860     5.95              0.035

HDF_tab_bzip2_c5   2.500    36.50              0.011

HDF_tab_bzip2_c9   2.500    36.50              0.011

But it might be different for you, because all my data was of the datetime dtype, so it's always better to make such a comparison with your real data or at least with the similar data...

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...