0 votes
1 view
in AI and Deep Learning by (42.1k points)

I have a big table in my database with a lot of words from various texts in the text order. I want to find the number of times/frequency that some set of words appears together.

Example: Supposing I have these 4 words in many texts: United | States | of | America. I will get as a result:

United States: 50

United States of: 45

United States of America: 40

(This is only an example with 4 words, but can there are with less and more than 4).

There is some algorithm that can do this or similar to this?

Edit: Some R or SQL code showing how to do is welcome. I need a practical example of what I need to do.

Table Structure

I have two tables: Token which haves id and text. The text is UNIQUE and each entry in this table represents a different word.

TextBlockHasToken is the table that keeps the text order. Each row represents a word in a text.

It haves textblockid that is the block of the text the token belongs. a sentence that is the sentence of the token, position that is the token position inside the sentence and tokenid that is the token table reference.

1 Answer

0 votes
by (90.8k points)

This is a typical use case for Markov chains. Estimate the Markov model from your textbase and find high probabilities in the transition table. Since these indicate probabilities that one word will follow another, phrases will show up as high transition probabilities.

By counting the number of times the phrase-start word showed up in the texts, you can also derive absolute numbers.

Also, there are algorithms described in the following link for pattern recognition in several sequences: consensus and alignment:

https://dornsife.usc.edu/assets/sites/516/docs/papers/msw_papers/msw-055.pdf

If you are looking to learn more about Artificial Intelligence then visit this Artificial Intelligence Course which will cover topics like Simulated annealing algorithm Euclidean distance, Pearson correlation coefficient, Brute force search algorithms, Backtracking, Traveling salesman problem, NeuroEvolution of augmenting topologies, Fitness function, Resolution algorithm,k-nearest neighbors algorithm, Markov model, Genetic algorithm, deep first iterative deeping and many more.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...