# Which algorithm I can use to find common adjacent words/ pattern recognition?

I have a big table in my database with a lot of words from various texts in the text order. I want to find the number of times/frequency that some set of words appears together.

Example: Supposing I have these 4 words in many texts: United | States | of | America. I will get as a result:

United States: 50

United States of: 45

United States of America: 40

(This is only an example with 4 words, but can there are with less and more than 4).

There is some algorithm that can do this or similar to this?

Edit: Some R or SQL code showing how to do is welcome. I need a practical example of what I need to do.

Table Structure

I have two tables: Token which haves id and text. The text is UNIQUE and each entry in this table represents a different word.

TextBlockHasToken is the table that keeps the text order. Each row represents a word in a text.

It haves textblockid that is the block of the text the token belongs. a sentence that is the sentence of the token, position that is the token position inside the sentence and tokenid that is the token table reference.

## 1 Answer

This is a typical use case for Markov chains. Estimate the Markov model from your textbase and find high probabilities in the transition table. Since these indicate probabilities that one word will follow another, phrases will show up as high transition probabilities.

By counting the number of times the phrase-start word showed up in the texts, you can also derive absolute numbers.

Also, there are algorithms described in the following link for pattern recognition in several sequences: consensus and alignment:

https://dornsife.usc.edu/assets/sites/516/docs/papers/msw_papers/msw-055.pdf

