Skip to content Skip to main navigation Skip to footer

Understand SPE benchmark

The SPE benchmark feature is great tool for quick and simple evaluation of processing speed directly on your hardware and using your audio files – simply call the .../benchmark endpoint corresponding to the technology you want to benchmark and wait for the result. The benchmark result summarizes the length of the processed speech, the processing time and the resulting Faster-than-Realtime (FtRT) processing speed.

You can run this benchmark on machines with different CPUs to compare the performace of various Phonexia technologies on them… e.g. to see difference between Intel processors (for which are our technologies optimized) and AMD processors. You can use the benchmark to check if the planned HW upgrade will get you the expected performance gain… You can use the benchmark to compare performance of new SPE version, new technology generation, or different technology model, on the same HW configuration… and so on.

Running benchmark

Benchmark can be run in two ways:

  • by calling .../benchmark endpoint as documented
  • by calling .../benchmark endpoint as documented, with additional path parameter

The first option uses set of audio files supplied with SPE in the {SPE}/data/benchmark directory.
The second option uses single audio file of your choice uploaded to SPE storage, specified by the path parameter.

The set of audio files supplied with SPE contains recordings of various length (from 30 seconds to 5 minutes) and with various speech/non-speech ratio. This is to account for the fact that both the length of the audio, and the amount of actual speech in the audio affect the processing speed… because the the non-speech parts are stripped from the audio before processing. The processing speed is then calculated as follows:
FtRT = sum_of_speech_lengths_in_all_recordings ÷ sum_of_processing_times_of_all_recordings

When using the option with your specified file, only that single recording is used… so, to account for various audio lengths and speech/non-speech ratios it is recommended to run the benchmark using multiple different audio files and calculate the average FtRT processing speed yourself.

Alternatively, you can tune (or hack) SPE and prepare your own, or replace the default set of benchmarking recordings – see further below…

Benchmark recordings sets

The default sets of audio files supplied with SPE are as follows (the version number 1.0 is present only for some historical reasons and is ignored):

└── 1.0
    ├── default
    │   ├── 030.wav
    │   ├── 060.wav
    │   ├── 090.wav
    │   ├── 120.wav
    │   ├── 150.wav
    │   ├── 180.wav
    │   ├── 210.wav
    │   ├── 240.wav
    │   ├── 270.wav
    │   └── 300.wav
    └── czech
        ├── 030.wav
        ├── 060.wav
        ├── 090.wav
        ├── 120.wav
        ├── 150.wav
        ├── 180.wav
        ├── 210.wav
        ├── 240.wav
        ├── 270.wav
        └── 300.wav

For majority of technologies, the content of default directory is used for the benchmarking.

Benchmarking of the language-specific technologies – STT (Speech To Text) and PHNREC (Phoneme Recognizer) – first tries to find a directory with a name matching the start of the benchmarked model name and if such directory is found, audio files from that directory are used (expecting that the audio contains speech in that corresponding language). If not found, it falls back to default directory.
The reason for language-specific data is that processing audio in different language than the language for which the model was trained negatively affects the processing speed (basically, the processing ‘slides’ through the file very quickly as it cannot find anything known in it).

The czech directory is included as an example (well… the STT/PHNREC models were renamed some time ago to cs_cz*, so the name czech does not actually match anymore…).

Tuning the recordings sets

You can tune – or hack, if you want – the sets provided with SPE by

  • replacing the content of default directory with your own audio files
  • creating a directory with name according the name-matching rule (see above) and putting audio files in corresponding language in the directory
    For example:

    • directory named es would be matched for es_6 and es_es_5 models, but not the old spanish_american model
    • directory named cs_cz_fin would be matched only for old cs_cz_fin model, but not the new cs_cz_5 or cs_cz_6 models

By carefully preparing the directory- and audiofiles structure, you can create an effective way to quickly get a basic picture of the speech technologies performance on different hardware configurations.


Huge thanks to Vojta for the exhaustive info! The article author basically just translated it ;-)…


Related Articles