Explore Courses Blog Tutorials Interview Questions
+1 vote
1 view
in Machine Learning by (4.2k points)

I would like to build a program to detect how close a user's audio recording is to another recording in order to correct the user's pronunciation. For example:

  1. I record myself saying "Good morning"
  2. I let a foreign student record "Good morning"
  3. Compare his recording to mine to see if his pronunciation was good enough.

I've seen this in some language learning tools (I believe Rosetta Stone does this), but how is it done? Note we're only dealing with speech (and not, say, music). What are some algorithms or libraries I should look into?

1 Answer

+2 votes
by (6.8k points)

This kind of problem is usually solved using machine learning techniques.

Break down the signal into a sequence of 20ms or 50ms frames. Extract features on each frame. MFCC is generally good for this kind of application, though there are features more specific to voice detection (4 Hz modulation energy - which is roughly the rate at which people speak; zero-crossing rate).

Then, using a training set of audio you have manually labeled as being a speech / not speech, train a classifier (Gaussian mixture models, SVM...) on the frames features.

This will permit you to classify unlabelled frames into speech/non-speech categories. The last step consists of smoothing the decisions (a frame classified as non-speech surrounded by hundreds of speech frames is likely to be classification error), for example using HMMs, or just a median filter.

A few references:

Robust speech/music classification in audio documents (Pinquier & al) Speech/music discrimination for multimedia system applications (El-Maleh & al) A comparison of features for speech/music discrimination (Carey & al).

Note that the features and classification techniques they describe also are relevant for the 1-class problem of detecting speech (instead of discriminating speech vs something else). In this case, you can use 1-class modeling techniques such as 1-class SVM, or just take the likelihood score out of a GMM trained on speech data as a "speechless" measure.

If, on the other hand, your problem is really discriminating speech vs something else (say music), you could also very well use unsupervised approaches that are focused on detecting the boundaries between similar audio content - instead of characteristic this content itself.

A better understanding of this will be learned through Machine Learning Tutorials

Browse Categories