Skip to content Skip to main navigation Skip to footer

Speech to Text (STT)

About STT

Phonexia Speech to Text (STT) converts speech in audio signals into plain text.

Technology works with both acoustics as well as dictionary of words, acoustic model and pronunciation. This makes it dependent on language and dictionary – only some set of words can be transcribed. As an input, audio file or stream is needed, together with selection of language model to be used for transcription. As an output the transcription in one of the formats is provided. The technology extract features out of voice, using acoustic and language models together with pronunciation all in recognition network creates a hypothesis of transcribed words and „decode“ the most possible transcription. Based on requested output types one or more transcribed text are returned with score and time frame.

Application areas:

  • Maintain high reaction times by routing calls with specific content/topic to human operators
  • Search for specific information in large call archives
  • Data-mine audio content and index for search
  • Advanced topic/content analysis provides additional value.


Technology overview

  • Trained with emphasis on spontaneous telephony conversation
  • Based on state-of-the-art techniques for acoustic modeling, including discriminative training and neural network-based features
  • Output
    • One-best transcription – i.e. a file with a time-aligned speech transcript (time of word’s start and end)
    • Variants for transcriptions – i.e. hypotheses for words at each moment (confusion network) or hypotheses for utterances at each slot (n-best transcription)
  • Processing speed – several versions available: from 8x faster than real-time processing on 1 CPU core (eg. standard 8 CPU core server (8 instances of STT) can process 1010 hours of audio in 1 day of computing time (flat load, depend on technology model))

Supported languages:

Acoustic models

Acoustic model  is created by training on training data. It includes characteristics of a voices of a set of speakers provided in a training set.

Acoustic model can be created for different languages, as Czech, English, French or others, or also for a separate dialects – Gulf Arabic, Levant Arabic, …. From the technology point of view difference between various languages is the same as between dialects – every created model will be suited more for a people talking same way.

As an example for English the following acoustic models can be trained:

  • US English – to be used with US speakers
  • British English – to be used with UK speakers

Language models

Language model consists of a list of words. This is limitation for a technology, as only the words from this list can go to the transcription.

Together with list of words also n-grams of words are present. N-grams are useful during decoding and making a decision. The technology takes into account the word sequences gained from training to „decide“ which of the possible transcriptions are most accurate.

Language models can differ for the same acoustic models. This means that they can include different words and different weights for n-grams. Using this the user can adjust a language model focusing on a specific domain to get better results.

Result types

During the process of transcribing the speech there are always several alternatives for a given speech segment. The technology can provide one or more results.

1-best result type provides only the result with highest score. Speech is returned in segments including always one word. Every such segment provides information about start and end, the transcribed word and a score.

n-best result provides several alternatives for sentences or bigger segments of speech with its score. It can be useful for analytics programs which can take more input and work on it. It can be used when speaker does not pronounce the word correctly and when technology evaluates the best result as not matching to what was really said.

Confusion network result type provides similar output as n-best, only with the exception that segments are returned word by word. Usage of confusion network is the same as of n-best.

Training of new models

To create new model of STT about 100 hours of annotated speech (not recordings) is needed. Speech sours needs to be the same as the required usage of the created model. Phonexia specializes on telephony speech (landline, GSM, VoIP)

During creation the language model, not only the annotations of training recordings are needed, also additional text data are used to make a model more robust. Best type of data are similar to desired usage of resulting technology model, which is usually spontaneous speech. However as it is complicated to obtain such amount of data of this type, also other sources are used.


The technology can be adapted in two levels – in the Acoustic Model or the Language Model.

Adapting the Acoustic Model to speakers from a specific region, or using a specific dialect, actually means creation of a new acoustic model. If there are not enough data to train a completely new model, the available new data can be added to the data used for training of the existing model. Based on these two data sets a new, more robust model can be created. However this training requires a lot of time.

The Language Model can be adapted more easily. This can be done especially when some words are missing in the language model (such words can never show up in the transcription), for example words from a specific business domain, or when words come up that the user does not want, e.g., informal spellings or offensive words for similar sounding correct words.

The language model can be adapted in two ways. First, faster but less precise, by adapting just individual words or second, by providing texts corresponding to 20 hours of audio from the target domain which contain all the words desired to be added to a language model. The second option provides also information about the usual contexts where the words occur.

Since the 5th generation of STT, we developed a tool that allows customers to customize language models by adding words specific to their domain or usecase.


To measure the accuracy of Phonexia Speech to Text the following points  should be taken into account:

  • Reason for the accuracy measurement
    What is the business value for measuring the accuracy? What type of output will be used in the use case for which accuracy is measured? It may be that only nouns, verbs and adjectives are important for machine understanding of speech context, or all of the words are important when the output text is intended for human processing.
  • Data quality
    Every metric requires comparing the automatic transcription to some baseline transcription, usually in the form of annotated data. The quality of these annotated data is crucial as it can impact the result of the measurement. Annotation of data might be done differently by different companies, for example:

    1. Half-automated annotation – auto-transcription checked by human annotators
    2. Annotation by two individual people

Also, every annotation is done with some aim. It might be training of new model, or only for measurement of quality.

  • Data processing before measurement
    Data as an output from speech transcription, before application of NLP process, will include pure transcription of the recording. On the other hand annotation can contain symbols which will never be output from speech transcription system.

    1. Numbers: transcription always include „thirteen“ instead of „13“, which can occur in the annotation
    2. Parentheses; transcription – „parentheses“, annotation „( )“
    3. Characters of national alphabets; transcription – only limited alphabet, annotation „ěščřžů,…“

Data in annotation needs to be processed to include only characters allowed in transcription to avoid quality impact.

Related Articles