0 votes
1 view
in AI and Deep Learning by (50.5k points)

What's the best way to store and search a database of natural language sentence structure trees?

Using OpenNLP's English Treebank Parser, I can get fairly reliable sentence structure parsings for arbitrary sentences. What I'd like to do is create a tool that can extract all the docstrings from my source code, generate these trees for all sentences in the docstrings, store these trees and their associated function name in a database, and then allow a user to search the database using natural language queries.

So, given the sentence "This uploads files to a remote machine." for the function upload_files(), I'd have the tree:



    (NP (DT This))


      (VBZ uploads)

      (NP (NNS files))

      (PP (TO to) (NP (DT a) (JJ remote) (NN machine))))

    (. .)))

If someone entered the query "How can I upload files?", equating to the tree:



    (WHADVP (WRB How))

    (SQ (MD can) (NP (PRP I)) (VP (VB upload) (NP (NNS files))))

    (. ?)))

how would I store and query these trees in a SQL database?

I've written a simple proof-of-concept script that can perform this search using a mix of regular expressions and network graph parsing, but I'm not sure how I'd implement this in a scalable way.

And yes, I realize my example would be trivial to retrieve using a simple keyword search. The idea I'm trying to test is how I might take advantage of grammatical structure, so I can weed-out entries with similar keywords, but a different sentence structure. For example, with the above query, I wouldn't want to retrieve the entry associated with the sentence "Checks a remote machine to find a user that uploads files." which has similar keywords but is describing a completely different behavior.

1 Answer

0 votes
by (108k points)

Relational databases cannot naturally save knowledge, what you need is a knowledge base or ontology (though it may be constructed on top of a relational database). It holds data in triplets <subject, predicate, object>, so your phrase will be stored as <upload_file(), upload, file>. There are a lot of tools and methods to search inside such as KBs (for example, Prolog is a language that was designed to do it). So, all you have to do is to translate sentences from natural language to KB triplets/ontology graph, translate user query to incomplete triplets (your question will look like <?, upload, file>) or conjunctive queries and then search on your KB. Open NLP will help you with translating, and the rest depends on the concrete techniques and technologies you decide to use.

You need to choose a different approach that builds on existing work on knowledge bases and natural language search. Storing context-free parse trees in a relational database isn't the problem, but it is going to be very difficult to do a meaningful comparison of parse trees as part of a search. When you are just interested in taking advantage of a little knowledge about grammatical relations, parse trees are too complicated.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !