Quality of the audio is extremely important for satisfactory results of any speech processing technology, being it simple voice activity detection, speech transcription, voice biometry, or other.
There are two main aspects of audio quality:
- technical quality of the audio data (format, codec, bitrate, SNR, …)
- sound quality of the actual content (background noise, reverberations, …)
Using inappropriate audio codec, heavy compression, too low bitrate, etc. can damage or even completely destroy essential parts of the audio signal required by speech technologies.
Commonly used audio compressions make use of perceptual limitation of human hearing and can remove frequencies which are covered by other frequencies, etc…
Therefore, to get satisfactory results from speech technologies, use appropriate audio format.
ⓘ TIP: Tools like MediaInfo can easily give you technical information about your audio files.
Set your PBX, media server or recording device to one of these formats (in the order of preferrence):
Lossy MP3 format is not preferred. If MP3 really has to be used, it must use bitrates at least 32 kbit/s per channel. Stereo audio must use full stereo encoding, not joint-stereo1.
Do not push for smallest possible audio file sizes, attempting to squeeze maximum number of recordings into a minimal storage space.
If you really have to use MP3, refrain from using joint-stereo encoding1 for 2-channel audio, use full stereo instead.
NOTE: If the audio was already heavily compressed, converting it to one of the “okay formats” really does NOT magically restore the information already lost during the original compression. No point trying that.
1 The joint-stereo encoding – which is commonly used by default in MP3 encoders – is tailored for usage with music audio, where both channels usually contain almost the same signal. Using joint-stereo encoding for telephony stereo, where each channel contains completely different signal (when one side speaks, the other side is silent) actually cripples the audio further.
Quality of the actual audio content is just as important as the technical quality.
Parasitic sounds like room reverberations, background noise (cars on the street, dog barking nearby), ambient voices (people talking in the office, TV playing in the room) or compression artifacts, affect the effectivity of speech technologies (precision of speaker identification, transcription accuracy, etc.).
Therefore it is essential to have as clean audio as possible.
Capture the sound as close to the source as possible, i.e.
to minimize the amount of ambient sounds and noise, reverberations, or artifacts caused by possible multiple recodings during transfer.
Store the audio in appropriate format (see above), to avoid distorting the sound by compression artifacts.
In general, the following recording methods or sources affect negatively the sound quality:
These are usually made to capture every possible sound, including those undesired for speech processing – office ambient noise and reverberations, other people talking, TV playing in the background, etc.
Also, do not store the recorded audio in compressed formats. Typically, surveillance cameras, smartphones or bugs tend to use heavily compressed formats by default.