Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (4.2k points)
Let's imagine, I have two English language texts written by the same person. Is it possible to apply some Markov chain algorithm to analyse each: create some kind of fingerprint based on statistical data, and compare fingerprints gotten from different texts? Let's say, we have a library with 100 texts. Some person wrote text number 1 and some other as well, and we need to guess which one by analyzing his/her writing style. Is there any known algorithm doing it? Can be Markov chains applied here?

1 Answer

0 votes
by (6.8k points)

Absolutely it is possible, and indeed the record of success in identifying an author given a text or some portion of it is impressive.

A couple of representative studies (warning: links are to pdf files):

• Quantitative Analysis of Literary Styles

• Stylogenetics: Clustering-based stylistic analysis of literary coroora

To aid your web-search, this discipline is often called Stylometry (and occasionally, Stylogenetics).

So the two most important questions are i suppose: which classifiers are useful for this purpose and what data is fed to the classifier?

What I still find surprising is how little data is required to achieve very accurate classification. Often the data is just a word frequency list. (A directory of word frequency lists is available online here.)

For instance, one data set widely used in Machine Learning and available from a number of places on the Web is comprised of data from four authors: Shakespeare, Jane Austen, Jack London, Milton. these works were divided into 872 pieces (corresponding roughly to chapters), in other words, about 220 different substantial pieces of text for each of the four authors; each of these pieces becomes a single data point in the data set. Next, a word-frequency scan was performed on each text, and the 70 most common words were used for the study, the remainder of the results of the frequency scan were discarded. Here are the first 20 of that 70-word list.

['a', 'all', 'also', 'an', 'and', 'any', 'are', 'as', 'at', 'be', 'been','but', 'by', 'can', 'do', 'down', 'even', 'every', 'for', 'from'] 

Each data point then is just a count of each word of the 70 words in each of the 872 chapters.

[78, 34, 21, 45, 76, 9, 23, 12, 43, 54, 110, 21, 45, 59, 87, 59, 34, 104, 93, 40]

Each of these data points is one instance of the author's literary fingerprint.

Since this would require a lot of permutations, combinations, data extraction and data mining, a piece of broad knowledge on this will be achieved through studying Machine Learning Algorithms, which is eventually a part of Machine Learning Certification

Browse Categories

...