Understand SPE benchmark
The SPE benchmark feature is great tool for quick and simple evaluation of processing speed directly on your hardware and using your audio files – simply call the .../benchmark
endpoint corresponding to the technology you want to benchmark and wait for the result. The benchmark result summarizes the length of the processed speech, the processing time and the resulting Faster-than-Realtime (FtRT) processing speed.
You can run this benchmark on machines with different CPUs to compare the performace of various Phonexia technologies on them… e.g. to see difference between Intel processors (for which are our technologies optimized) and AMD processors. You can use the benchmark to check if the planned HW upgrade will get you the expected performance gain… You can use the benchmark to compare performance of new SPE version, new technology generation, or different technology model, on the same HW configuration… and so on.
Running benchmark
Benchmark can be run in two ways:
- by calling
.../benchmark
endpoint as documented - by calling
.../benchmark
endpoint as documented, with additionalpath
parameter
The first option uses set of audio files supplied with SPE in the {SPE}/data/benchmark
directory.
The second option uses single audio file of your choice uploaded to SPE storage, specified by the path
parameter.
The set of audio files supplied with SPE contains recordings of various length (from 30 seconds to 5 minutes) and with various speech/non-speech ratio. This is to account for the fact that both the length of the audio, and the amount of actual speech in the audio affect the processing speed… because the the non-speech parts are stripped from the audio before processing. The processing speed is then calculated as follows:
FtRT = sum_of_speech_lengths_in_all_recordings ÷ sum_of_processing_times_of_all_recordings
When using the option with your specified file, only that single recording is used… so, to account for various audio lengths and speech/non-speech ratios it is recommended to run the benchmark using multiple different audio files and calculate the average FtRT processing speed yourself.
Alternatively, you can tune (or hack) SPE and prepare your own, or replace the default set of benchmarking recordings – see further below…
Benchmark recordings sets
The default sets of audio files supplied with SPE are as follows (the version number 1.0 is present only for some historical reasons and is ignored):
benchmark └── 1.0 ├── default │ ├── 030.wav │ ├── 060.wav │ ├── 090.wav │ ├── 120.wav │ ├── 150.wav │ ├── 180.wav │ ├── 210.wav │ ├── 240.wav │ ├── 270.wav │ └── 300.wav └── czech ├── 030.wav ├── 060.wav ├── 090.wav ├── 120.wav ├── 150.wav ├── 180.wav ├── 210.wav ├── 240.wav ├── 270.wav └── 300.wav
For majority of technologies, the content of default
directory is used for the benchmarking.
Benchmarking of the language-specific technologies – STT (Speech To Text) and PHNREC (Phoneme Recognizer) – first tries to find a directory with a name matching the start of the benchmarked model name and if such directory is found, audio files from that directory are used (expecting that the audio contains speech in that corresponding language). If not found, it falls back to default
directory.
The reason for language-specific data is that processing audio in different language than the language for which the model was trained negatively affects the processing speed (basically, the processing ‘slides’ through the file very quickly as it cannot find anything known in it).
The czech
directory is included as an example (well… the STT/PHNREC models were renamed some time ago to cs_cz*
, so the name czech
does not actually match anymore…).
Tuning the recordings sets
You can tune – or hack, if you want – the sets provided with SPE by
- replacing the content of
default
directory with your own audio files - creating a directory with name according the name-matching rule (see above) and putting audio files in corresponding language in the directory
For example:- directory named
es
would be matched fores_6
andes_es_5
models, but not the oldspanish_american
model - directory named
cs_cz_fin
would be matched only for oldcs_cz_fin
model, but not the newcs_cz_5
orcs_cz_6
models
- directory named
By carefully preparing the directory- and audiofiles structure, you can create an effective way to quickly get a basic picture of the speech technologies performance on different hardware configurations.
CREDITS:
Huge thanks to Vojta for the exhaustive info! The article author basically just translated it ;-)…