Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in AI and Deep Learning by (50.2k points)

We have a database with hundreds of millions of records of log data. We're attempting to 'group' this log data as being likely to be of the same nature as other entries in the log database. For instance:

Record X may contain a log entry like:

Change Transaction ABC123 Assigned To Server US91

And Record Y may contain a log entry like:

Change Transaction XYZ789 Assigned To Server GB47

To us humans those two log entries are easily recognizable as being likely related in some way. Now, there may be 10 million rows between Record X and Record Y. And there may be thousands of other entries that are similar to X and Y, and some that are different but that have other records they are similar to.

What I'm trying to determine is the best way to group similar items together and say that with XX% certainty Record X and Record Y are probably of the same nature. Or perhaps a better way of saying it would be that the system would look at Record Y and say based on your content you're most like Record X as opposed to all other records.

I've seen some mentions of Natural Language Processing and other ways to find similarity between strings (like just brute-forcing some Levenshtein calculations) - however for us we have these two additional challenges:

The content is machine-generated - not human-generated

As opposed to a search engine approach where we determine results for a given query - we're trying to classify a giant repository and group them by how alike they are to one another.

Thanks for your input!

1 Answer

0 votes
by (108k points)

There's a scale issue in your question because you don't want to start comparing each record to every other record in the DB. You should look at growing a list of "known types" and scoring records against the types in that list to see if each record has a match on that list.

The scoring part will surely draw some good responses here your ability to score against known types is key to getting this to work well, and I have a feeling you're in a better position than we are to get that right. Some sort of Soundex match, maybe? Or if you can figure out how to "discover" which parts of new records change, you could define your known types as regex expressions.

At that point, for each record, you can hopefully determine that you've got a match (with high confidence) or a match (with lower confidence) or very likely no match at all. In this last case, you've likely found a new "type" that should be added to your "known types" list. If you keep track of the score for each record you matched, you could also go back for low-scoring matches and see if a better match showed up later in your processing.

If you wish then you can learn more about NLP by joining Intellipaat's NLP Training.

Browse Categories

...