Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (50.2k points)

The statistical software Stata allows short text snippets to be saved within a dataset. This is accomplished either using notes and/or characteristics.

This is a feature of great value to me as it allows me to save a variety of information, ranging from reminders and to-do lists to information about how I generated the data, or even what the estimation method for a particular variable was.

I am now trying to come up with similar functionality in Python 3.6. So far, I have looked online and consulted several posts, which however do not exactly address what I want to do.

For a small NumPy array, I have concluded that a combination of the function numpy.savez() and a dictionary can store adequately all relevant information in a single file.

For example:

a = np.array([[2,4],[6,8],[10,12]])

d = {"first": 1, "second": "two", "third": 3}

np.savez(whatever_name.npz, a=a, d=d)

data = np.load(whatever_name.npz)

arr = data['a']

dic = data['d'].tolist()

However, the question remains:

Are there better ways to potentially incorporate other pieces of information in a file containing a NumPy array or a (large) Pandas DataFrame?

I am particularly interested in hearing about the particular pros and cons of any suggestions you may have with examples. The fewer dependencies, the better.

1 Answer

0 votes
by (108k points)

The advantage of using HDF5 is that it is portable (can be read outside of Python), native compression, out-of-memory capabilities, and have metadata support. When storing the dataframe to h5 you have the benefit of storing a dictionary of metadata as well, which can be your notes to self, or actual metadata that does not need to be stored in the dataframe (You can use this for setting flags as well, e.g. {'is_agl': True, 'scale_factor': 100, 'already_corrected': False, etc.}). In this context, there is no difference between using a numpy array and a dataframe. 

You can refer the following link for the same:

http://docs.h5py.org/en/latest/index.html

Browse Categories

...