Explore Courses Blog Tutorials Interview Questions
0 votes
in AI and Deep Learning by (50.2k points)

Like lots of you guys on SO, I often write in several languages. And when it comes to planning stuff, (or even answering some SO questions), I actually think and write in some unspecified hybrid language. Although I used to be taught to do this using flow diagrams or UML-like diagrams, in retrospect, I find "my" pseudocode language has components of C, Python, Java, bash, Matlab, Perl, Basic. I seem to unconsciously select the idiom best suited to expressing the concept/algorithm.

Common idioms might include Java-like braces for scope, pythonic list comprehensions or indentation, C++like inheritance, C#-style lambdas, Matlab-like slices, and matrix operations.

I noticed that it's actually quite easy for people to recognize exactly what I'm trying to do, and quite easy for people to intelligently translate into other languages. Of course, that step involves considering the corner cases and the moments where each language behaves idiosyncratically.

But in reality, most of these languages share a subset of keywords and library functions which generally behave identically - maths functions, type names, while/for/if, etc. Clearly, I'd have to exclude many 'odd' languages like Lisp, APL derivatives, but...

So my questions are,

  1. Does code already exist that recognizes the programming language of a text file? (Surely this must be a less complicated task than eclipse's syntax trees or than Google Translate's language guessing feature, right?) In fact, does the SO syntax highlighter do anything like this?

  2. Is it theoretically possible to create a single interpreter or compiler that recognizes what language idiom you're using at any moment and (maybe "intelligently") executes or translates to a runnable form. And flags the corner cases where my syntax is ambiguous with regards to behavior. Immediate difficulties I see include: knowing when to switch between indentation-dependent and brace-dependent modes, recognizing funny operators (like *pointer vs *kwargs) and knowing when to use list vs array-like representations.

  3. Is there any language or interpreter in existence, that can manage this kind of flexible interpreting?

  4. Have I missed an obvious obstacle to this being possible?

1 Answer

0 votes
by (108k points)

Yes, there exists a code that recognizes the programming language of a text file. we have to implement a simple pseudocode interpreter which has very basic functionality as well as a contiguous memory store. Here is how it works:

You feed the program a CRLF separated list of valid syntax. The files named “pseudocode.txt” should be in the same directory as the program.

The program is able to read, parse, and executes the given instructions present in the text.

Any errors that are identified, are reported, else the program runs until STOP instruction or EOF. The instruction set is very limited. There are about 18 operations which can have 2 operands. If something is contained in [brackets] then that is referring to one of the 1000 available memory locations ([0] to [999]).

To detect what programming language has used is the method used in spam filters. You split the snippet into words. Then you compare the occurrences of these words with known snippets and compute the probability that this snippet is written in language X for every language you're interested in.Bayesian_spam_filtering

If you have the mechanism of adding new languages then just train the detector with a few snippets in the new language (you could feed it an open source project). This way the system is likely to appear in C# snippets and "puts" in Ruby snippets.

Browse Categories