Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (50.2k points)

Consider the following example:

Prepare the data:

import string import random import pandas as pd matrix = np.random.random((1003000)) my_cols = [random.choice(string.ascii_uppercase) for x in range(matrix.shape[1])] mydf = pd.DataFrame(matrix, columns=my_cols) mydf['something'] = 'hello_world'

Set the highest compression possible for HDF5:

store = pd.HDFStore('myfile.h5',complevel=9, complib='bzip2') store['mydf'] = mydf store.close()

Save also to CSV:

mydf.to_csv('myfile.csv', sep=':')

The result is:

  • myfile.csv is 5.6 MB big

  • myfile.h5 is 11 MB big

The difference grows bigger as the datasets get larger.

I have tried with other compression methods and levels. Is this a bug? (I am using Pandas 0.11 and the latest stable version of HDF5 and Python).

1 Answer

0 votes
by (108k points)

Your sample is really too small. HDF5 has a generous amount of overhead with really small sizes (even 300k entries is on the smaller side). Floats are really more efficiently presented in binary (that as a text representation).

In extension, HDF5 is row-based. You get MUCH productivity by having tables that are not too wide but are fairly long. Hence your example is not very effective in HDF5 at all, store it transposed in this case.

...