Python Speech Recognition Introduction And Practice

Amazon’s huge success with Alexa has proven: in the near future, implementing a degree of voice support will become a basic requirement of everyday technology. Python programs that integrate speech recognition provide a level of interactivity and accessibility that no other technology can match. Most importantly, implementing speech recognition in Python programs is very simple.

1. An Overview Of How Speech Recognition Works.

Speech recognition originated from research done at bell LABS in the early 1950s. Early speech recognition systems could identify only a single speaker and a vocabulary of about a dozen words. Modern speech recognition systems have made great strides in recognizing multiple speakers and have large vocabularies that recognize multiple languages.

The first part of speech recognition, of course, is speech. Through a microphone, speech is converted from physical sound to electrical signal, and then to data through an analog-to-digital converter. Once digitized, several models can be used to transcribe audio into text.

Most modern speech recognition systems rely on HMM. How it works: speech signals can be approximated as static processes on a very short time scale (say, 10 milliseconds), a process whose statistical properties do not change over time.

Many modern speech recognition systems use neural networks to simplify speech signals through feature transformation and dimensional reduction before HMM recognition. Voice activity detector (VAD) can also be used to reduce audio signals to parts that may contain only speech.

Fortunately for Python users, some speech recognition services are available online through the API, and most of them also provide the Python SDK.

2. Select The Python Speech Recognition Package.

There are some off-the-shelf speech-recognition packages in PYPI as below.

apiai
google-cloud-speech

pocketsphinx
SpeechRcognition
watson-developer-cloud
wit

Some packages (such as wit and apiai) provide built-in capabilities that go beyond basic speech recognition, such as natural language processing capabilities that identify the speaker’s intent. Other software packages, such as google cloud voice, focus on voice to text conversion. SpeechRecognition stands out for its ease of use.

Recognizing speech requires input of audio, and retrieving audio input in SpeechRecognition is very simple. It does not need to build an access microphone and script to process audio files from scratch. It only needs a few minutes to automatically complete retrieval and run.

The SpeechRecognition library fits several mainstream speech apis, making it extremely flexible. The google web speech api supports the default api keys hardcoded into the speech recognition library, which can be used without registration. SpeechRecognition, with its flexibility and ease of use, is the best choice for writing Python programs.

3. Install SpeechRecognation.

SpeechRecognition is compatible with Python2.6, 2.7, and 3.3+, but using it in Python2 requires some additional installation steps. All development versions in this tutorial default to Python 3.3+.

Install SpeechRecognition from the terminal using the PIP command.

$ pip install SpeechRecognition

When the installation complete, open the interpreter window and enter the following to verify the installation.

$ python 
>>>import speech_recognition as sr
>>> sr.__version__
'3.8.1

Note: do not close this session, you will use it in the next few steps.

To handle an existing audio file, simply call SpeechRecognition directly, paying attention to the dependencies of specific use cases. Also note that the PyAudio package should be installed to get microphone input.

4. Speech Recognizer Class.

The core of SpeechRecognition is the recognizer class. Recognizer API is mainly used to recognize speech. Each API has multiple Settings and functions to recognize speech of audio source.

recognize_bing(): Microsoft Bing Speech
recognize_google(): Google Web Speech API
recognize_google_cloud(): Google Cloud Speech - requires installation of the google-cloud-speech package
recognize_houndify(): Houndify by SoundHound
recognize_ibm(): IBM Speech to Text
recognize_sphinx(): CMU Sphinx - requires installing PocketSphinx
recognize_wit(): Wit.ai

In above seven API only recognition_sphinx () can work offline with the CMU Sphinx engine, and the other six need to connect to the Internet.

Speech recognition comes with the default API key for the Google Web Speech API, which can be used directly. The other six apis require an API key or a username/password combination for authentication, so this article uses the Google Web Speech API.

Now call the recognise_google () function in the interpreter session, it will prompt you to input the audio file path.

>>> sr.recognize_google()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: recognize_google() missing 1 required positional argument: 'audio_data'

The 7 recognizer classes need input the audio_data as parameter, and the audio_data of each recognizer must be an instance of the AudioData class of SpeechRecognition. There are two paths to create an AudioData instance: an audio file or audio recorded by a microphone.

5. Use Of Audio Files.

First you need to download the audio file and save it to the directory where the Python interpreter session is located. The AudioFile class can be initialized through the path of the AudioFile and provides a context manager interface for reading and processing the contents of the file.

5.1 Supported File Type.

SpeechRecognition currently supports below file types.

WAV: must be PCM/LPCM format.

AIFF.
AIFF-C.
FLAC: must be the original FLAC format; OGG-FLAC format is not available.

If you use X-86 Linux, macOS or Windows, you need to support FLAC files. If running under other systems, you need to install the FLAC encoder and make sure you have access to the FLAC command.

5.2 Use record() To Get Data From Audio File.

In the interpreter dialog box, type the following commands to handle the contents of the “harvard.wav” file:

>>> harvard = sr.AudioFile('harvard.wav')
>>> with harvard as source:
... audio = r.record(source)
...

Above code will open the file and read the content of the file through the context manager, store the data in the AudioFile instance, and read the whole file data into the AudioData instance through the record() function, which can be confirmed by checking the audio type.

>>> type(audio)
<class 'speech_recognition.AudioData'>

You can now call recognition_google () to try to recognize the voice in the audio.

>>> sr.recognize_google(audio)
'the stale smell of old beer lingers it takes heat
to bring out the odor a cold dip restores health and
zest a salt pickle taste fine with ham tacos al
Pastore are my favorite a zestful food is the hot
cross bun'

This completes the recording of the first audio file.

5.3 Capture Audio Clips Using Offset And Duration.

What if you only want to capture part of the speech in the file? The record() command has an duration keyword parameter that causes the command to stop recording after the specified number of seconds.

For example, the following code only gets the voice in the file for the first four seconds。

>>> with harvard as source:
... audio = r.record(source, duration=4)
...
>>> sr.recognize_google(audio)
'the stale smell of old beer lingers'

When the record() command is called in the with block, the stream moves forward. This means that if you record four seconds and then four seconds, the first four seconds audio will be saved in audio1 and the later four seconds of audio will be saved in audio2.

>>> with harvard as source:
... audio1 = sr.record(source, duration=4)
... audio2 = sr.record(source, duration=4)
...
>>> sr.recognize_google(audio1)
'the stale smell of old beer lingers'
>>> sr.recognize_google(audio2)
'it takes heat to bring out the odor a cold dip'

In addition to specifying the record duration, you can specify a starting point using the offset parameter for the record() command, whose value represents the time at which the record began.

For example, to get only the second phrase in the file, set the offset for 4 seconds and record the duration for 3 seconds.

>>> with harvard as source:
... audio = sr.record(source, offset=4, duration=3)
...
>>> recognizer.recognize_google(audio)
'it takes heat to bring out the odor'

The offset and duration keyword parameters are very useful for splitting audio files when you know the speech structure of the file in advance. But inaccurate use can lead to poor transcription.

>>> with harvard as source:
... audio = r.record(source, offset=4.7, duration=2.8)
...
>>> recognizer.recognize_google(audio)
'Mesquite to bring out the odor Aiko'

Above program records from 4.7 seconds, so that the phrase “it takes heat to bring out the odor” in the phrase “it t” is not recorded, and the API only gets the input “akes heat” and matches the result “Mesquite”. Similarly, the API only caught “a co” at the end of the recording for the phrase “a cold dip restores health and zest” and was mismatched as “Aiko”.

Noise is also a major contributor to translation accuracy. In the above example, the audio file works well because it is clean, but in reality it is impossible to get noiseless audio unless the audio file is processed beforehand.

5.4 The Influence Of Noise On Speech Recognition.

Noise does exist in the real world, and all recordings have a degree of noise, unprocessed noise can undermine the accuracy of speech recognition applications.

To understand how noise affects the speech recognition, please download the jackhammer. Wav files, and make sure to save it to the interpreter of the session working directory. The phrase “the stale smell of old beer lingers” was pronounced in the background of a loud drilling.

5.4.1 What happens when you try to transcribe this file?

>>> jackhammer = sr.AudioFile('jackhammer.wav')
>>> with jackhammer as source:
... audio = sr.record(source)
...
>>> sr.recognize_google(audio)
'the snail smell of old gear vendors'

So what to do about it? You can try to call the adjust_for_ambient_noise () command of the Recognizer class.

>>> with jackhammer as source:
... sr.adjust_for_ambient_noise(source)
... audio = sr.record(source)
...
>>> sr.recognize_google(audio)
'still smell of old beer vendors'

That’s a lot closer to the exact result, but there’s still a problem with accuracy, and the word “the” is missing. Why?

Because the adjust_for_ambient_noise () command is used to identify the first second of the file stream as the noise level for audio by default, the first second of the file is consumed before record () is used to get the data.

You can use the duration keyword parameter to adjust the time analysis range of the adjust_for_ambient_noise() command, which is in seconds and defaults to 1, and now reduces this value to 0.5.

>>> with jackhammer as source:
... r.adjust_for_ambient_noise(source, duration=0.5)
... audio = r.record(source)
...
>>> r.recognize_google(audio)
'the snail smell like old Beer Mongers'

Now we get the phrase “the”, but now there are some new problems – sometimes because the signal is too noisy to eliminate the impact of noise.

If we often encounter these problems, we need to preprocess the audio. This preprocessing can be done through audio editing software or by applying filters to python files packages (such as SciPy).

When dealing with noisy files, you can improve accuracy by looking at the actual API response. Most APIs return a JSON string containing multiple possible transcriptions, but the recognition_google() method always returns only the most probable transcription characters if a complete response is not required.

The complete response is achieved by changing the True parameter in recognition_google () into show_all.

>>> r.recognize_google(audio, show_all=True)
{'alternative': [
{'transcript': 'the snail smell like old Beer Mongers'}, 
{'transcript': 'the still smell of old beer vendors'}, 
{'transcript': 'the snail smell like old beer vendors'},
{'transcript': 'the stale smell of old beer vendors'}, 
{'transcript': 'the snail smell like old beermongers'}, 
{'transcript': 'destihl smell of old beer vendors'}, 
{'transcript': 'the still smell like old beer vendors'}, 
{'transcript': 'bastille smell of old beer vendors'}, 
{'transcript': 'the still smell like old beermongers'}, 
{'transcript': 'the still smell of old beer venders'}, 
{'transcript': 'the still smelling old beer vendors'}, 
{'transcript': 'musty smell of old beer vendors'}, 
{'transcript': 'the still smell of old beer vendor'}
], 'final': True}

As you can see, recognition_google() returns a list with the keyword ‘alternative’, referring to all possible response lists. This response list structure is different from API and is mainly used to debug the results.

6. How To Use Microphone.

To access the microphone with SpeechRecognizer, you must install PyAudio package. Close the current interpreter window and do the following.

6.1 Install PyAudio.

6.1.1 For Windows.

Windows users can directly call pip to install PyAudio.

$ pip install pyaudio

6.1.2 For Debian Linux.

If you are using Debian based Linux (such as Ubuntu), you can use apt to install PyAudio.

$ sudo apt-get install python-pyaudio python3-pyaudio

You may still need to run $ pip install pyaudio once the installation complete, especially if it is running in a virtual environment.

6.1.3 For MacOS.

MacOS users first need to use Homebrew to install PortAudio, and then call the PIP command to install PyAudio.

$ brew install portaudio
$ pip install pyaudio

6.2 Installation Testing.

Once you have PyAudio installed, you can test the installation from the console.

$ python -m speech_recognition

Make sure the default microphone is turned on. If it is installed correctly, you should see the following.

A moment of silence, please...
Set minimum energy threshold to 600.4452854381937
Say something!

Please speak to the microphone and observe how SpeachRecognition transcripts your speech.

6.3 Use Python Microphone Class.

Open another interpreter session and create an example of the recognizer class. Instead of using the audio file as the signal source, use the default system microphone. The reader can access it through the creation of an instance of the Microphone class.

>>> import speech_recognition as sr
>>> r = sr.Recognizer()
>>> mic = sr.Microphone()

If the system has no default microphone and want to use a non-default microphone, you need to specify which microphone to use by providing a device index. You can obtains the Microphone name list by list_microphone_names () function of the Microphone class.

>>> sr.Microphone.list_microphone_names()
['HDA Intel PCH: ALC272 Analog (hw:0,0)',
'HDA Intel PCH: HDMI 0 (hw:0,3)',
'sysdefault',
'front',
'pulse',
'dmix', 
'default']

list_microphone_names() returns the index of microphone device names in the list. In the output above, if you want to use a microphone named “front” that indexes 3 in the list, you can create the microphone instance use below code.

>>> mic = sr.Microphone(device_index=3)

6.4 Use listen() Function To Get Input Data From The Microphone.

Once the microphone instance is ready, you can capture some input audio data from microphone. Like the AudioFile class, the Microphone is a context manager.

Recognizer’s class listen () method can be used to capture microphone input. The method takes the audio source as the first parameter and automatically records input from the source until it stops when mute is detected.

>>> with mic as source:
... audio = r.listen(source)

After executing the with block, try saying “hello” to the microphone. Please wait for the interpreter to display the prompt again, once the “>>>” prompt returns, you can recognize the voice with below code.

>>> r.recognize_google(audio)
'hello'

If you are not prompted to return, perhaps because the microphone is receiving too much ambient noise, use Ctrl + C to interrupt the process so that the interpreter displays the prompt again.

To process environmental noise, the Recognizer class’s adjust_for_ambient_noise () function is called, which behaves the same way it does with noise audio files. Because microphone input sounds are less predictable than audio files, you can use this process whenever you listen to microphone input.

>>> with mic as source:
... r.adjust_for_ambient_noise(source)
... audio = r.listen(source)

After running the code above, wait a moment and try saying “hello” to the microphone. Again, you must wait for the interpreter prompt to return before attempting to recognize speech.

Remember, adjust_for_ambient_noise () defaults to making adjustments to 1-second long audio sources. If you think this time is too long, you can adjust it with the duration parameter.

6.5 Process Speech That Is Difficult To Recognize.

Try typing the previous code example into the interpreter and entering some incomprehensible noise into the microphone. You should get below result:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jerry/real_python/speech_recognition_primer/venv/lib/python3.5/site-packages/speech_recognition/__init__.py", line 858, in recognize_google
if not isinstance(actual_result, dict) or len(actual_result.get("alternative", [])) == 0: raise UnknownValueError()
speech_recognition.UnknownValueError

Audio that cannot be matched to text by the API will cause UnknownValueError. Therefore, try and except blocks are frequently used to solve this problem. The API does its best to convert any sound into text, such as a short purr that might be recognized as “How,” or a cough, applause, or tongue click that might be converted into text and cause abnormalities