Saving in a file an array or DataFrame together with other information

Question

asked Sep 12, 2019 in Data Science by ashely (50.2k points)

The statistical software Stata allows short text snippets to be saved within a dataset. This is accomplished either using notes and/or characteristics.

This is a feature of great value to me as it allows me to save a variety of information, ranging from reminders and to-do lists to information about how I generated the data, or even what the estimation method for a particular variable was.

I am now trying to come up with similar functionality in Python 3.6. So far, I have looked online and consulted several posts, which however do not exactly address what I want to do.

For a small NumPy array, I have concluded that a combination of the function numpy.savez() and a dictionary can store adequately all relevant information in a single file.

For example:

a = np.array([[2,4],[6,8],[10,12]])
d = {"first": 1, "second": "two", "third": 3}
np.savez(whatever_name.npz, a=a, d=d)
data = np.load(whatever_name.npz)
arr = data['a']
dic = data['d'].tolist()

However, the question remains:

Are there better ways to potentially incorporate other pieces of information in a file containing a NumPy array or a (large) Pandas DataFrame?

I am particularly interested in hearing about the particular pros and cons of any suggestions you may have with examples. The fewer dependencies, the better.

1 Answer

vinita · Answer 1 · 2019-09-12T14:08:37+0000

The advantage of using HDF5 is that it is portable (can be read outside of Python), native compression, out-of-memory capabilities, and have metadata support. When storing the dataframe to h5 you have the benefit of storing a dictionary of metadata as well, which can be your notes to self, or actual metadata that does not need to be stored in the dataframe (You can use this for setting flags as well, e.g. {'is_agl': True, 'scale_factor': 100, 'already_corrected': False, etc.}). In this context, there is no difference between using a numpy array and a dataframe.

You can refer the following link for the same:

http://docs.h5py.org/en/latest/index.html

Saving in a file an array or DataFrame together with other information

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources