Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I coded the for loop to enumerate a multidimensional ndarray containing n rows of 28x28 pixel values.

I am looking for the index of each row that is duplicated and the indices of the duplicates without redundancies.

I found this code here (thanks unutbu) and modified it to read the ndarray, it works 70% of the time, however 30% of the time it is identifying the wrong images as duplicates.

How can it be improved to detect the correct rows?

def overlap_same(arr):

seen = []

dups = collections.defaultdict(list)

for i, item in enumerate(arr):

    for j, orig in enumerate(seen):

        if np.array_equal(item, orig):

            dups[j].append(i)

            break

    else:

        seen.append(item)

return dups

e.g. return overlap_same(train) returns:

defaultdict(<type 'list'>, {34: [1388], 35: [1815], 583: [3045], 3208:

[4426], 626: [824], 507: [4438], 188: [338, 431, 540, 757, 765, 806,

808, 834, 882, 1515, 1539, 1715, 1725, 1789, 1841, 2038, 2081, 2165,

2170, 2300, 2455, 2683, 2733, 2957, 3290, 3293, 3311, 3373, 3446, 3542,

3565, 3890, 4110, 4197, 4206, 4364, 4371, 4734, 4851]})

plotting some samples of the correct case on matplotlib gives:

fig = plt.figure()

a=fig.add_subplot(1,2,1)

plt.imshow(train[35])

a.set_title('train[35]')

a=fig.add_subplot(1,2,2)

plt.imshow(train[1815])

a.set_title('train[1815]')

plt.show

train data 35 vs 1815

which is correct

However:

fig = plt.figure()

a=fig.add_subplot(1,2,1)

plt.imshow(train[3208])

a.set_title('train[3208]')

a=fig.add_subplot(1,2,2)

plt.imshow(train[4426])

a.set_title('train[4426]')

plt.show

enter image description here

is incorrect as they do not match

Sample data (train[:3])

array([[[-0.5       , -0.5       , -0.5       , ...,  0.48823529,

      0.5       ,  0.17058824],

    [-0.5       , -0.5       , -0.5       , ...,  0.48823529,

      0.5       , -0.0372549 ],

    [-0.5       , -0.5       , -0.5       , ...,  0.5       ,

      0.47647059, -0.24509804],

    ..., 

    [-0.49215686,  0.34705883,  0.5       , ..., -0.5       ,

     -0.5       , -0.5       ],

    [-0.31176472,  0.44901961,  0.5       , ..., -0.5       ,

     -0.5       , -0.5       ],

    [-0.11176471,  0.5       ,  0.49215686, ..., -0.5       ,

     -0.5       , -0.5       ]],

   [[-0.24509804,  0.2764706 ,  0.5       , ...,  0.5       ,

      0.25294119, -0.36666667],

    [-0.5       , -0.47254902, -0.02941176, ...,  0.20196079,

     -0.46862745, -0.5       ],

    [-0.49215686, -0.5       , -0.5       , ..., -0.47647059,

     -0.5       , -0.49607843],

    ..., 

    [-0.49215686, -0.49607843, -0.5       , ..., -0.5       ,

     -0.5       , -0.49215686],

    [-0.5       , -0.5       , -0.26862746, ...,  0.13137256,

     -0.46470588, -0.5       ],

    [-0.30000001,  0.11960784,  0.48823529, ...,  0.5       ,

      0.28431374, -0.24117647]],

   [[-0.5       , -0.5       , -0.5       , ..., -0.5       ,

     -0.5       , -0.5       ],

    [-0.5       , -0.5       , -0.5       , ..., -0.5       ,

     -0.5       , -0.5       ],

    [-0.5       , -0.5       , -0.5       , ..., -0.5       ,

     -0.5       , -0.5       ],

    ..., 

    [-0.5       , -0.5       , -0.5       , ...,  0.48431373,

      0.5       ,  0.31568629],

    [-0.5       , -0.49215686, -0.5       , ...,  0.49215686,

      0.5       ,  0.04901961],

    [-0.5       , -0.5       , -0.5       , ...,  0.04117647,

     -0.17450981, -0.45686275]]], dtype=float32)

1 Answer

0 votes
by (41.4k points)

Your problem can be solved efficiently using  numpy_indexed package which has a lot of functionality.

Below is the code which can be used to find unique images.

import numpy_indexed as npi

unique_training_images = npi.unique(train)

For finding all the indices of each unique group, this below code can be used:

indices = npi.group_by(train).split(np.arange(len(train)))

If you wish to learn more about how to use python for data science, then go through this data science python course by Intellipaat for more insights.

Browse Categories

...