Find indices of duplicated rows in python ndarray

Question

asked Jul 11, 2019 in Data Science by sourav (17.6k points)

I coded the for loop to enumerate a multidimensional ndarray containing n rows of 28x28 pixel values.

I am looking for the index of each row that is duplicated and the indices of the duplicates without redundancies.

I found this code here (thanks unutbu) and modified it to read the ndarray, it works 70% of the time, however 30% of the time it is identifying the wrong images as duplicates.

How can it be improved to detect the correct rows?

def overlap_same(arr):
seen = []
dups = collections.defaultdict(list)
for i, item in enumerate(arr):
for j, orig in enumerate(seen):
if np.array_equal(item, orig):
dups[j].append(i)
break
else:
seen.append(item)
return dups

e.g. return overlap_same(train) returns:

defaultdict(<type 'list'>, {34: [1388], 35: [1815], 583: [3045], 3208:
[4426], 626: [824], 507: [4438], 188: [338, 431, 540, 757, 765, 806,
808, 834, 882, 1515, 1539, 1715, 1725, 1789, 1841, 2038, 2081, 2165,
2170, 2300, 2455, 2683, 2733, 2957, 3290, 3293, 3311, 3373, 3446, 3542,
3565, 3890, 4110, 4197, 4206, 4364, 4371, 4734, 4851]})

plotting some samples of the correct case on matplotlib gives:

fig = plt.figure()
a=fig.add_subplot(1,2,1)
plt.imshow(train[35])
a.set_title('train[35]')
a=fig.add_subplot(1,2,2)
plt.imshow(train[1815])
a.set_title('train[1815]')
plt.show

which is correct

However:

fig = plt.figure()
a=fig.add_subplot(1,2,1)
plt.imshow(train[3208])
a.set_title('train[3208]')
a=fig.add_subplot(1,2,2)
plt.imshow(train[4426])
a.set_title('train[4426]')
plt.show

enter image description here

is incorrect as they do not match

Sample data (train[:3])

array([[[-0.5 , -0.5 , -0.5 , ..., 0.48823529,
0.5 , 0.17058824],
[-0.5 , -0.5 , -0.5 , ..., 0.48823529,
0.5 , -0.0372549 ],
[-0.5 , -0.5 , -0.5 , ..., 0.5 ,
0.47647059, -0.24509804],
...,
[-0.49215686, 0.34705883, 0.5 , ..., -0.5 ,
-0.5 , -0.5 ],
[-0.31176472, 0.44901961, 0.5 , ..., -0.5 ,
-0.5 , -0.5 ],
[-0.11176471, 0.5 , 0.49215686, ..., -0.5 ,
-0.5 , -0.5 ]],
[[-0.24509804, 0.2764706 , 0.5 , ..., 0.5 ,
0.25294119, -0.36666667],
[-0.5 , -0.47254902, -0.02941176, ..., 0.20196079,
-0.46862745, -0.5 ],
[-0.49215686, -0.5 , -0.5 , ..., -0.47647059,
-0.5 , -0.49607843],
...,
[-0.49215686, -0.49607843, -0.5 , ..., -0.5 ,
-0.5 , -0.49215686],
[-0.5 , -0.5 , -0.26862746, ..., 0.13137256,
-0.46470588, -0.5 ],
[-0.30000001, 0.11960784, 0.48823529, ..., 0.5 ,
0.28431374, -0.24117647]],
[[-0.5 , -0.5 , -0.5 , ..., -0.5 ,
-0.5 , -0.5 ],
[-0.5 , -0.5 , -0.5 , ..., -0.5 ,
-0.5 , -0.5 ],
[-0.5 , -0.5 , -0.5 , ..., -0.5 ,
-0.5 , -0.5 ],
...,
[-0.5 , -0.5 , -0.5 , ..., 0.48431373,
0.5 , 0.31568629],
[-0.5 , -0.49215686, -0.5 , ..., 0.49215686,
0.5 , 0.04901961],
[-0.5 , -0.5 , -0.5 , ..., 0.04117647,
-0.17450981, -0.45686275]]], dtype=float32)

1 Answer

Shlok Pandey · Answer 1 · 2019-07-20T05:46:54+0000

Your problem can be solved efficiently using numpy_indexed package which has a lot of functionality.

Below is the code which can be used to find unique images.

import numpy_indexed as npi
unique_training_images = npi.unique(train)

For finding all the indices of each unique group, this below code can be used:

indices = npi.group_by(train).split(np.arange(len(train)))

If you wish to learn more about how to use python for data science, then go through this data science python course by Intellipaat for more insights.

Find indices of duplicated rows in python ndarray

1 Answer

Related questions

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources