find a subset of predictions which maximize the number of correct majority decision

Question

asked Jul 13, 2019 in Data Science by sourav (17.6k points)

Assuming I have a multi-class sequence prediction problem, with the correct answer:

gold = [1,2,1,0,2,2]

There are N models which gives different predictions:

pred1 = [1,2,2,0,2,2]
pred2 = [2,2,1,1,1,1]
pred3 = [1,2,1,0,2,1]
pred4 = [1,1,0,2,1,2]
pred5 = [2,2,1,0,1,2]

I would like to find a subset of the predictions (e.g. [pred1, pred3, pred5]), such that if I take the most common item at each position, the number of correct voting results is maximized.

In the actually problem, the length of sequence > 10000, and N > 100, is there any efficient way to find the subset?

Currently I am simply randomly sampling the subset, since I couldn't find an exact search algorithm (nor do I know if it exists). Some heuristics could help to reduce the computation, e.g. remove the unanimous predictions, but not complexity-wise.

If an efficient answer doesn't exist, I would also like to have a solution on a relaxed problem: namely binary instead of multi-class prediction, such that the most common prediction is the majority (>50%) one.

(Solutions with Python/Numpy native functions are appreciated)

1 Answer

Shlok Pandey · Answer 1 · 2019-07-20T10:52:50+0000

The first step is to sort the lists:

gold = [1,2,1,0,2,2]

preds = [
[1,2,2,0,2,2],
[2,2,1,1,1,1],
[1,2,1,0,2,1],
[1,1,0,2,1,2],
[2,2,1,0,1,2]
]

def count_correct(pred, goal=gold):
return sum(1 for a,b in zip(pred, goal) if a==b)

print(sorted(preds, key=count_correct, reverse=True))
# [[1, 2, 2, 0, 2, 2], [1, 2, 1, 0, 2, 1], [2, 2, 1, 0, 1, 2], [2, 2, 1, 1, 1, 1], [1, 1, 0, 2, 1, 2]]
# the predictions that are correct, calculate the indices for them, and then try to look for a set cover:
def correct_ids(pred, goal=gold):
return [i for i,(a,b) in enumerate(zip(pred,goal)) if a==b]
print([correct_ids(pred) for pred in preds])
# [[0, 1, 3, 4, 5], [1, 2], [0, 1, 2, 3, 4], [0, 5], [1, 2, 3, 5]]

If you wish to learn more about how to use python for data science, then go through data science python programming course by Intellipaat for more insights.

find a subset of predictions which maximize the number of correct majority decision

1 Answer

Related questions

Browse Categories