Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Machine Learning by (19k points)

I asked a question similar to this one a couple of weeks ago, but I did not ask the question correctly. So I am re-asking here the question with more details and I would like to get a more AI oriented answer.

I have a list representing products which are more or less the same. For instance, in the list below, they are all Seagate hard drives.

Seagate Hard Drive 500Go

Seagate Hard Drive 120Go for laptop

Seagate Barracuda 7200.12 ST3500418AS 500GB 7200 RPM SATA 3.0Gb/s Hard Drive

New and shinny 500Go hard drive from Seagate

Seagate Barracuda 7200.12

Seagate FreeAgent Desk 500GB External Hard Drive Silver 7200RPM USB2.0 Retail

GE Spacemaker Laudry

Mazda3 2010

Mazda3 2009 2.3L

For a human being, the hard drives 3 and 5 are the same. We could go a little bit further and suppose that the products 1, 3, 4 and 5 are the same and put in other categories the product 2 and 6.

In my previous question, someone suggested to me to use feature extraction. It works very well when we have a small dataset of predefined descriptions (all hard drives), but what about all the other kind of description? I don't want to start to write regex based feature extractors for all the descriptions my application could face, it doesn't scale. Is there any machine learning algorithm that could help me to achieve this? The range of description that I can get is very wide, on line 1, it could be a fridge, and then on the next line, a hard drive. Should I try to take the Neural Network path? What should be my inputs?

Thank you for the help!

1 Answer

0 votes
by (33.1k points)

You should try both clustering and classification. The categories seem open-ended to suggest that clustering may fit the problem better. For input representation, you can try your luck with extracting the word and character n-grams. Your similarity measure may be the count of common n-grams, or something more sophisticated. You may need to label the resulting clusters manually.

For more details, check out the Machine Learning Tutorial.

Hope this answer helps you!

Browse Categories

...