Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in AI and Deep Learning by (50.2k points)

I'm developing an iOS app that does voice-based AI; i.e. it's meant to take voice input from the microphone, turn it into text, send it to an AI agent, then output the returned text through the speaker. I've got everything working, though using a button to start and stop recording the speech (SpeechKit for voice recognition, API.AI for the AI, Amazon's Polly for the output).

The piece that I need is to have the microphone always on and to automatically start and stop the recording of the user's voice as they begin and end talking. This app is being developed for an unorthodox context, where there will be no access to the screen for the user (but they will have a high-end shotgun mic for recording their text).

My research suggests this piece of the puzzle is known as 'Voice Activity Detection' and seems to be one of the hardest steps in the whole voice-based AI system.

I'm hoping someone can either supply some straightforward (Swift) code to implement this myself or point me in the direction of some decent libraries / SDKs that I can implement in this project.

1 Answer

0 votes
by (107k points)

In a speech transcription there are four primary actors involved in:

  • SFSpeechRecognizer is the primary controller in the framework. Its most important job is to generate recognition tasks and return results. It also handles authorization and configures locales.

  1. SFSpeechRecognitionRequest is the base class for recognition requests. Its job is to point the SFSpeechRecognizer to an audio source from which transcription should occur. There are two concrete types that are: the SFSpeechURLRecognitionRequest, for reading from a file and the SFSpeechAudioBufferRecognitionRequest for reading from a buffer.
  • SFSpeechRecognitionTask objects are created when a request is kicked off by the recognizer. They are used to track the progress of transcription or cancel it.

  • SFSpeechRecognitionResult objects contain the transcription of a chunk of the audio. Each result typically corresponds to a single word.

You can learn how to transcribe live or pre-recorded audio in your iOS app with the same engine used by Siri in this speech recognition tutorial for iOS, by referring to this link: https://www.raywenderlich.com/573-speech-recognition-tutorial-for-ios#toc-anchor-003

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...