0 votes
1 view
in AI and Deep Learning by (38.4k points)


I want to classify/categorize/cluster/group together a set of several thousand websites. There's data that we can train on, so we can do supervised learning, but it's not the data that we've gathered and we're not adamant about using it -- so we're also considering unsupervised learning.

  • What features can I use in a machine-learning algorithm to deal with multilingual data? Note that some of these languages might not have been dealt with in the Natural Language Processing field.

  • If I were to use an unsupervised learning algorithm, should I just partition the data by language and deal with each language differently? Different languages might have different relevant categories (or not, depending on your psycholinguistic theoretical tendencies), which might affect the decision to partition.

  • I was thinking of using decision trees or maybe Support Vector Machines (SVMs) to allow for more features (from my understanding of them). 

Pragmatic approaches are welcome! (Theoretical ones, too, but those might be saved for later fun.)

Some context

We are trying to classify a corpus of many thousands of websites in 3 to 5 languages (maybe up to 10, but we're not sure).

We have training data in the form of hundreds of websites already classified. However, we may choose to use that data set or not -- if other categories make more sense, we're open to not using the training data that we have, since it is not something we gathered in the first place. We are on the final stages of scraping data/text from websites.

Now we must decide on the issues above. I have done some work with the Brown Corpus and the Brill tagger, but this will not work because of the multiple-languages issue.

We intend to use the Orange machine learning package.

1 Answer

0 votes
by (82.9k points)

According to the context you have presented, this is a supervised learning problem. Therefore, you are doing classification, not clustering.

I would start with the simplest features, namely, tokenize the Unicode text of the pages, and use a dictionary to translate every new token to a number, and simply consider the presence of a token as a feature.

Subsequent, you can use the simplest algorithm. you can tend to go with Naive Bayes, but if you have an easy way to run SVM this is also fine.

Now just compare your results with some baseline - say assigning the most frequent class to all the pages.

If the simplest approach is not good enough then you can start iterating over algorithms and features.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !