Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

My question: How to train a classifier with only positive and neutral data?

I am building a personalized article recommendation system for educational purposes. The data I use is from Instapaper.

Datasets

I only have positive data: - Articles that I have read and "liked", regardless of reading/unread status

And neutral data (because I have expressed interest in it, but I may not like it later anyway): - Articles that are unread - Articles that I have read and marked as read but I did not "like" it

The data I do not have is negative data: - Articles that I did not send to Instapaper to read it later (I am not interested, although I have browsed that page/article) - Articles that I might not even have clicked into, but I might have or might not have archived it.

My problem

In such a problem, negative data is basically missing. I have thought of the following solution(s) but did not resolve to them yet:

1) Feed a number of negative data to the classifier Pros: Immediate negative data to teach the classifier Cons: As the number of articles I like increase, the negative data effect on the classifier dims out

2) Turn the "neutral" data into negative data Pros: Now I have all the positive and (new) negative data I need Cons: Despite the neutral data is of mild interest to me, I'd still like to get recommendations on such article, but perhaps as a less value class.

1 Answer

0 votes
by (33.1k points)

There is a Spy EM algorithm, that might help to solve this problem.

It is a text learning or classification system that learns from a set of positive and unlabeled examples. It is based on a "spy" technique, naive Bayes and EM algorithm.

The basic idea is to merge your positive set with a whole bunch of random documents. You should initially treat all the random documents as the negative class, and implement a naive Bayes classifier on that set. Some documents will actually be positive, and you can relabel any documents that are scored higher than the lowest-scoring held out true positive document. Then you repeat this process until it stabilizes.

Hope this answer helps.

Browse Categories

...