What's the best way to store and search a database of natural language sentence structure trees?
Using OpenNLP's English Treebank Parser, I can get fairly reliable sentence structure parsings for arbitrary sentences. What I'd like to do is create a tool that can extract all the docstrings from my source code, generate these trees for all sentences in the docstrings, store these trees and their associated function name in a database, and then allow a user to search the database using natural language queries.
So, given the sentence "This uploads files to a remote machine." for the function upload_files(), I'd have the tree:
(NP (DT This))
(NP (NNS files))
(PP (TO to) (NP (DT a) (JJ remote) (NN machine))))
If someone entered the query "How can I upload files?", equating to the tree:
(WHADVP (WRB How))
(SQ (MD can) (NP (PRP I)) (VP (VB upload) (NP (NNS files))))
how would I store and query these trees in a SQL database?
I've written a simple proof-of-concept script that can perform this search using a mix of regular expressions and network graph parsing, but I'm not sure how I'd implement this in a scalable way.
And yes, I realize my example would be trivial to retrieve using a simple keyword search. The idea I'm trying to test is how I might take advantage of grammatical structure, so I can weed-out entries with similar keywords, but a different sentence structure. For example, with the above query, I wouldn't want to retrieve the entry associated with the sentence "Checks a remote machine to find a user that uploads files." which has similar keywords but is describing a completely different behavior.