Speech Engine version 3.32 and later includes new STT feature called Preferred phrases.
This article explains what is the feature good for, how does it work internally and gives some tips for practical implementation.
What are preferred phrases
In the speech transcription tasks, there may be situations where similar sounding words get confused, e.g. “WiFi” vs. “HiFi”, “route” vs. “root”, “cell” vs. “sell”, etc.
Normally, the language model part of the Speech To Text does its job here and in the context of longer phrase or entire sentence prefers the correct word:
|× I’m going to cell my car.||Hmmm, such sentence does not sound like common English…|
|√ I’m going to sell my car.||…but this one is very common. So this sentence is a clear winner.|
However, there are sentences which still make sense in both cases:
|√ My WiFi stopped working.||This sentence makes sense…|
|√ My HiFi stopped working.||…and this makes sense too. So which one is the winner?!|
And this is where the preferred phrases feature may come handy, allowing to prompt the speech transcription with phrases or combination of words which are expected to appear in the audio, thus increasing the chances of correctly transcribed words, increasing the overall transcription accuracy.
The intended application of this feature is mainly voicebots, where the question being asked at a certain point of the dialog is known, so the possible answers are nicely predictable.
But it can help in other applications, too – e.g. in transcribing domain-specific audios, the frequently used domain-specific phrases can be boosted
How preferred phrases work
The picture below shows a simplified standard speech transcription process schema – the digitized speech signal spectrum is analyzed in the neural network acoustic model (which describes the pronunciations of a given language) and goes into a decoder.
The decoder uses the information from acoustic model, combines it with information from language model recognition network (which describes the statistics about word grouping and sentences of a given language) and provides the transcription output.
To get more details about speech transcription, see the Speech To Text article.
When using preferred phrases, we build additional language model from the specified preferred phrases and interpolate it in realtime with the generic language model.
The preferred words and phrases are favored, while retaining the existing accuracy on common text.
P(word|history) = Pgeneric(word|history) + αPpreferred(word|history)
- single-word phrases are preferred only in single-worded utterances… i.e. only in responses like “yes“, “no“, “drink“, “food“, etc… but not inside longer sentences like “I’d like to have a cold drink and a hot food“
- multi-word phrases are preferred anywhere in the utterances… i.e. phrases like “hot drink” or “good food” would be preferred in utterances like “Hot drink gets handy in cold winter days“, or “I’d like to have a hot drink“, or “I’d do anything for a hot drink, some good food and a warm bed“
- it makes sense to define 5-words phrases at maximum, since longer combinations are not supported by the current STT generation
- preferred phrases must contain only known words, i.e. only words “present in the dictionary”
Phrases containing unknown words are logged as warning messages into SPE log and ignored.
You can add these words to the language model first (see STT Language Model Customization tutorial) and then use the preferred phrases with this customized model
- only latest 5th generation of STT supports preferred phrases
- number of preferred phrases is not limited… but from practical point of view, the result of using like thousands of phrases is somehow questionnable
And which phrases to put in the list?
The answer is: Those which will help the system to recognize the right terms or phrases… i.e. those which are expected in the text, but get transcribed incorrectly.
In a voicebot implementation, the good candidates would be phrases taken from utterances for which the intent detection failed (i.e. the intent was identified incorrectly, or was not detected at all). It means that the “itent detection failed” utterances should be ideally manually analyzed (i.e. the transcription received by the intent detector should be compared with the dialog source audio recording and the actual problematic part of the utterance should be identified) and the potentially problematic phrase could be included in the list of preferred phrases used for starting the transcription in the particular part of the dialog.
Using (relevant parts of) some utterances from intent detector training set as preferred phrases could be also a good starting point.
You may also find useful this article about finetuning end-of-utterance detection parameters.