Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I coded the for loop to enumerate a multidimensional ndarray containing n rows of 28x28 pixel values.

I am looking for the index of each row that is duplicated and the indices of the duplicates without redundancies.

I found this code here (thanks unutbu) and modified it to read the ndarray, it works 70% of the time, however 30% of the time it is identifying the wrong images as duplicates.

How can it be improved to detect the correct rows?

def overlap_same(arr):

seen = []

dups = collections.defaultdict(list)

for i, item in enumerate(arr):

    for j, orig in enumerate(seen):

        if np.array_equal(item, orig):

            dups[j].append(i)

            break

    else:

        seen.append(item)

return dups

e.g. return overlap_same(train) returns:

defaultdict(<type 'list'>, {34: [1388], 35: [1815], 583: [3045], 3208:

[4426], 626: [824], 507: [4438], 188: [338, 431, 540, 757, 765, 806,

808, 834, 882, 1515, 1539, 1715, 1725, 1789, 1841, 2038, 2081, 2165,

2170, 2300, 2455, 2683, 2733, 2957, 3290, 3293, 3311, 3373, 3446, 3542,

3565, 3890, 4110, 4197, 4206, 4364, 4371, 4734, 4851]})

plotting some samples of the correct case on matplotlib gives:

fig = plt.figure()

a=fig.add_subplot(1,2,1)

plt.imshow(train[35])

a.set_title('train[35]')

a=fig.add_subplot(1,2,2)

plt.imshow(train[1815])

a.set_title('train[1815]')

plt.show

train data 35 vs 1815

which is correct

However:

fig = plt.figure()

a=fig.add_subplot(1,2,1)

plt.imshow(train[3208])

a.set_title('train[3208]')

a=fig.add_subplot(1,2,2)

plt.imshow(train[4426])

a.set_title('train[4426]')

plt.show

enter image description here

is incorrect as they do not match

Sample data (train[:3])

array([[[-0.5       , -0.5       , -0.5       , ...,  0.48823529,

      0.5       ,  0.17058824],

    [-0.5       , -0.5       , -0.5       , ...,  0.48823529,

      0.5       , -0.0372549 ],

    [-0.5       , -0.5       , -0.5       , ...,  0.5       ,

      0.47647059, -0.24509804],

    ..., 

    [-0.49215686,  0.34705883,  0.5       , ..., -0.5       ,

     -0.5       , -0.5       ],

    [-0.31176472,  0.44901961,  0.5       , ..., -0.5       ,

     -0.5       , -0.5       ],

    [-0.11176471,  0.5       ,  0.49215686, ..., -0.5       ,

     -0.5       , -0.5       ]],

   [[-0.24509804,  0.2764706 ,  0.5       , ...,  0.5       ,

      0.25294119, -0.36666667],

    [-0.5       , -0.47254902, -0.02941176, ...,  0.20196079,

     -0.46862745, -0.5       ],

    [-0.49215686, -0.5       , -0.5       , ..., -0.47647059,

     -0.5       , -0.49607843],

    ..., 

    [-0.49215686, -0.49607843, -0.5       , ..., -0.5       ,

     -0.5       , -0.49215686],

    [-0.5       , -0.5       , -0.26862746, ...,  0.13137256,

     -0.46470588, -0.5       ],

    [-0.30000001,  0.11960784,  0.48823529, ...,  0.5       ,

      0.28431374, -0.24117647]],

   [[-0.5       , -0.5       , -0.5       , ..., -0.5       ,

     -0.5       , -0.5       ],

    [-0.5       , -0.5       , -0.5       , ..., -0.5       ,

     -0.5       , -0.5       ],

    [-0.5       , -0.5       , -0.5       , ..., -0.5       ,

     -0.5       , -0.5       ],

    ..., 

    [-0.5       , -0.5       , -0.5       , ...,  0.48431373,

      0.5       ,  0.31568629],

    [-0.5       , -0.49215686, -0.5       , ...,  0.49215686,

      0.5       ,  0.04901961],

    [-0.5       , -0.5       , -0.5       , ...,  0.04117647,

     -0.17450981, -0.45686275]]], dtype=float32)

1 Answer

0 votes
by (41.4k points)

Your problem can be solved efficiently using  numpy_indexed package which has a lot of functionality.

Below is the code which can be used to find unique images.

import numpy_indexed as npi

unique_training_images = npi.unique(train)

For finding all the indices of each unique group, this below code can be used:

indices = npi.group_by(train).split(np.arange(len(train)))

If you wish to learn more about how to use python for data science, then go through this data science python course by Intellipaat for more insights.

31k questions

32.9k answers

507 comments

693 users

...