Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

NumPy seems to lack built-in support for 3-byte and 6-byte types, aka uint24 and uint48. I have a large data set using these types and want to feed it to numpy. What I currently do (for uint24):

import numpy as np
dt = np.dtype([('head', '<u2'), ('data', '<u2', (3,))])
# I would like to be able to write
#  dt = np.dtype([('head', '<u2'), ('data', '<u3', (2,))])
#  dt = np.dtype([('head', '<u2'), ('data', '<u6')])
a = np.memmap("filename", mode='r', dtype=dt)
# convert 3 x 2byte data to 2 x 3byte
# w1 is LSB, w3 is MSB
w1, w2, w3 = a['data'].swapaxes(0,1)
a2 = np.ndarray((2,a.size), dtype='u4')
# 3 LSB
a2[0] = w2 % 256
a2[0] <<= 16
a2[0] += w1
# 3 MSB
a2[1] = w3
a2[1] <<=8
a2[1] += w2 >> 8
# now a2 contains "uint24" matrix

While it works for 100MB input, it looks inefficient (think of 100s GBs of data). Is there a more efficient way? For example, creating a special kind of read-only view which masks part of the data would be useful (kind of "uint64 with two MSBs always zero" type). I only need read-only access to the data.

1 Answer

0 votes
by (32.3k points)

As per my knowledge, I don’t think there's a direct way to do what you're asking for(it would require unaligned access, which is highly inefficient on some architectures). But I found an efficient way to transfer the data to an in-process array:

a = np.memmap("filename", mode='r', dtype=np.dtype('>u1'))

e = np.zeros(a.size / 6, np.dtype('>u8'))

for i in range(3):

    e.view(dtype='>u2')[i + 1::4] = a.view(dtype='>u2')[i::3]

You can get unaligned access using the strides constructor parameter:

e = np.ndarray((a.size - 2) // 6, np.dtype('<u8'), buf, strides=(6,))

However with this each element will overlap with the next, so to actually use it you'd have to mask out the high bytes on access.

Browse Categories