2 views

Assuming I have a multi-class sequence prediction problem, with the correct answer:

gold = [1,2,1,0,2,2]

There are N models which gives different predictions:

pred1 = [1,2,2,0,2,2]

pred2 = [2,2,1,1,1,1]

pred3 = [1,2,1,0,2,1]

pred4 = [1,1,0,2,1,2]

pred5 = [2,2,1,0,1,2]

I would like to find a subset of the predictions (e.g. [pred1, pred3, pred5]), such that if I take the most common item at each position, the number of correct voting results is maximized.

In the actually problem, the length of sequence > 10000, and N > 100, is there any efficient way to find the subset?

Currently I am simply randomly sampling the subset, since I couldn't find an exact search algorithm (nor do I know if it exists). Some heuristics could help to reduce the computation, e.g. remove the unanimous predictions, but not complexity-wise.

If an efficient answer doesn't exist, I would also like to have a solution on a relaxed problem: namely binary instead of multi-class prediction, such that the most common prediction is the majority (>50%) one.

(Solutions with Python/Numpy native functions are appreciated)

by (41.4k points)

The first step is to sort the lists:

gold = [1,2,1,0,2,2]

preds = [

[1,2,2,0,2,2],

[2,2,1,1,1,1],

[1,2,1,0,2,1],

[1,1,0,2,1,2],

[2,2,1,0,1,2]

]

def count_correct(pred, goal=gold):

return sum(1 for a,b in zip(pred, goal) if a==b)

print(sorted(preds, key=count_correct, reverse=True))

# [[1, 2, 2, 0, 2, 2], [1, 2, 1, 0, 2, 1], [2, 2, 1, 0, 1, 2], [2, 2, 1, 1, 1, 1], [1, 1, 0, 2, 1, 2]]

# the predictions that are correct, calculate the indices for them, and then try to look for a set cover:

def correct_ids(pred, goal=gold):

return [i for i,(a,b) in enumerate(zip(pred,goal)) if a==b]

print([correct_ids(pred) for pred in preds])

# [[0, 1, 3, 4, 5], [1, 2], [0, 1, 2, 3, 4], [0, 5], [1, 2, 3, 5]]

If you wish to learn more about how to use python for data science, then go through data science python programming course by Intellipaat for more insights.