How to find similar messages in a large database

Question

asked Aug 27, 2019 in AI and Deep Learning by ashely (50.2k points)

I have a database with 2.000.000 messages. When a user receipt a message I need to find relevant messages in my database based on the occurrence of words.

I had tried run a batch process to summarize my database: 1 - Store all words(except an, a, the, of, for...) of all messages. 2 - Create association between all messages and the words contained therein (I also store the frequency of this word appears in the message.)

Then, when I receipt a message: 1 - I parse words (it looks like with the first step of my batch process.) 2 - Perform query in the database to fetching messages sorted by numbers of coincident words.

However, the process of updating my word base and the query to fetching similar messages are very heavy and slow. The word base update lasts ~1.2111 seconds for a message of 3000 bytes. The query similar messages last ~9.8 seconds for a message with the same size.

The database tuning already been done and the code works fine.

I need a better algorithm to do it.

Any ideas?

1 Answer

vinita · Answer 1 · 2019-08-27T10:02:29+0000

I would suggest the book "Collective Intelligence". It's written for Python, but there is plenty of theory for you to implement it in another language. The very first chapter of the book involves what you are trying to do.

How to find similar messages in a large database

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources