Explore Courses Blog Tutorials Interview Questions
0 votes
in AI and Deep Learning by (50.2k points)

I am currently working on a neural network-based approach to short document classification, and since the corpus I am working with is usually around ten words, the standard statistical document classification methods are of limited use. Due to this fact, I am attempting to implement some form of automated synonym detection for the matches provided in the training. My question more specifically is about resolving a situation as follows:

Say I have classifications of "Involving Food", and one of "Involving Spheres" and a data set as follows:

"Eating Apples"(Food); "Eating Marbles"(Spheres); 

"Eating Oranges"(Food, Spheres); "Throwing Baseballs(Spheres)";

"Throwing Apples(Food)"; "Throwing Balls(Spheres)"; 

"Spinning Apples"(Food); "Spinning Baseballs";

I am looking for an incremental method that would move towards the following linkages:

Eating --> Food 

Apples --> Food 

Marbles --> Spheres 

Oranges --> Food, Spheres 

Throwing --> Spheres 

Baseballs --> Spheres 

Balls --> Spheres 

Spinning --> Neutral 

Involving --> Neutral

I do realize that in this specific case these might be slightly suspect matches, but it illustrates the problems I am having. My general thoughts were that if I incremented a word for appearing opposite the words in a category, but in that case I would end up incidentally linking everything to the word "Involving", I then thought that I would simply decrement a word for appearing in conjunction with multiple synonyms, or with non-synonyms, but then I would lose the link between "Eating" and "Food". Does anyone have any clue as to how I would put together an algorithm that would move me in the directions indicated above?

1 Answer

0 votes
by (108k points)

Classification is the task of assigning category labels, taken from a predefined set of categories (classes) to instances in a dataset. Within the classical supervised learning paradigm, the task is approached by providing a learning algorithm with a training set of manually labeled examples. In practice, it is not always easy to apply this schema to Natural Language Processing (NLP) tasks. For example, supervised systems for text categorization (TC) require large amounts of hand-labeled texts, whereas in many applied cases it is quite difficult to collect hand-labeled data. On the other hand, unlabeled text collections are in general easily available, even with the requirement that they keep the same distribution from which the test documents are sampled.

You can follow the unsupervised boot-strap approach. The algorithm goes as follows:

Consider if two words are a synonym, then in your collection of data, they will appear in a similar setting.

Now we have two lists: the first one contains those words which accompany the food items and the second one contains those words which are food items.

In the supervised part, you can start by launching one of the lists, like for example, you can write the word orange on the food item list. Now let the computer do its task on its own.

In the unsupervised part, firstly it will find all the words from the collection that have appeared just before orange and will perform the sorting operation in the order of the most occurring.  Now you can take the top two words from the sorted list and add those words into the list which contains the items that accompany the food items. For example, "eating" and "Delicious" are the top two.

Now use that list to find the next two top food words by ranking the words that appear to the right of each word in the list. Continue this process expanding each list until you are happy with the results.

Now once the whole process is done, you might have to remove some things from the lists as you proceed which are clearly not suitable for the lists.

For more information regarding the Improving Text Categorization Bootstrapping via Unsupervised Learning, you can refer to the following link:

Browse Categories