The supported audio format are:
- WAVE (*.wav) container including any of:
- unsigned 8-bit PCM (u8)
- unsigned 16-bit PCM (u16le)
- IEEE float 32-bit (f32le)
- A-law (alaw)
- µ-law (mulaw)
- FLAC codec inside FLAC (*.flac) container
- OPUS codec inside OGG (*.opus) container
Other audio formats must be converted using external tools. SPE server can be configured to support automated conversion on background, see corresponding SPE configuration settings.
Great tools for converting other than supported formats to supported are FFmpeg (http://www.ffmpeg.org) or SoX (http://sox.sourceforge.net/). Both are multiplatform software tools for MS Windows, Linux and Apple OS X. Example of usage:
ffmpeg -i <source_audio_file_name> <output_audio_base_name>.wav
It causes that any supported format/codec audio file will be converted to normalised WAV audio format in 16-bit PCM little-endian as it is the default system. For more parameters please check manual pages.
sox <source_audio_file_name> -b 16 <output_audio_base_name>.wav
Number of bits defined by
-b parameter must be specified.
Phonexia Browser application may return error “1007: Unsupported audio format” during uploading audio file. Please consider if your audio files are in Q: What are the supported audio formats? .
But if you need use as input audio recordings in other formats, you can configure SPE for audio automated conversion. As prerequisite install external tool for audio conversion. Recommend is
ffmpeg utility, powerful and well documented. Please find your distribution package at http://ffmpeg.org
Then continue as described below:
Using Phonexia Browser with embed SPE
Open the Browser configuration dialog by click on button “Settings” located in tool ribbon. Select tab “Speech Engine” and configure SPE as described in documentation. Don’t forget select checkbox “Enable audio converter”.
Using SPE as service/daemon
settings\phxspe.properties using standard text editor. Then change the following line in “phxspe.properties” to enable background conversion:
audio_converter.enabled = false # change it to 'true'
Please check if the conversion tools configured below this line in phxspe.properties are configured properly. Here is an example of configuration for ffmpg:
# Set converter command # %1 is for input file # %2 is for output file ffmpeg example: audio_converter.command = ffmpeg -loglevel warning -y -i %1 %2 # sox example: # audio_converter.command = sox %1 %2
Important note: By design and saving computing resources ‘audio converter’ is not used if INPUT file ends with the extension .wav. In that case you must pre-process the audio recording before uploading it to the Phonexia SPE or using it in the Phonexia Browser.
It depends on the technology.
Phonexia Language Identification (LID) is pre-trained for 60+ languages.
Phonexia Keyword Spotting (KWS) and Phonexia Speech Transcription (STT) for 20+ languages including English, French, German, Russian, Spanish and many more.
A: Please check SPE subdirectory ./settings for configuration files.
- If only phxspe.browser.properties exists, then your Browser uses SPE as embedded component and set inside the file this directive:
server.enable_authentication_token = false
In that case you can still use SPE with Basic HTTP authentication, as described in documentation, section “Basic authentication“
- If you would like to play with “pure” daemon installation, then phxspe.properties file should exist in ./settings subdirectory. File phxspe.properties is created by
phxadminutility or can be created from ./data/phxspe.properties.default template file.
- Copy template file to ./settings directory
- Rename it to phxspe.properties
- Check for
server.enable_authentication_tokendirective and setup it as needed.
Basic installation steps are described in ./doc/INSTALL.html document.
For evaluating the real life scenario of Phonexia Speaker Identification technology, the system needs to be calibrated by SID dataset.
SID dataset (minimum requirements):
- 500 speakers
- >5 individual recordings per speaker*
- >30s per recording (>20s speech on each recording)
- speaker labels
- 1 speaker per channel
- phone or mobile phone source
- spontaneous dialogue (better than scripted or read text)
- wav, opus, flac audio format – for best results use only the natively supported audio formats – see list of supported audio formats
- diversity of age, gender, time of the day
*Note: splitting single recording into multiple shorter recordings in order to meet the criteria of at least 5 recordings for each speaker is not the right way to proceed. This way you are not adding any details. You are essentially analyzing details of a single recording five times. In contrast, by using 5 unique recordings coming from different audio environments or even different times of the day, additional details can be analyzed leading to better results.
- Open terminal in folder where PhxBrowser.exe is located (hold Shift and click right mouse button in free space in windows explorer and select “open command window here”)
- Run PhxBrowser software with command:
PhxBrowser.exe /spe-debug /spe-output
- PhxBrowser software will start with “SPE output” tab which shows the debug output of SPE
- Run PhxBrowser software in terminal with command:
./PhxBrowser --spe-debug --spe-output
- PhxBrowser software will start with ” SPE output” tab which shows debug output of SPE
A: Threshold for score isn’t set up correctly. Adjust speaker score sharpness value to calibrate the recalculation.
Please see Calibration in technology documentation.
A: These abbreviations mean the following:
- LR – likelihood ratio, result from statistical test for two models comparison. It returns a number which expresses how many times more likely the data are under one model than the other. LR meets numbers in interval <0;+inf).
- LLR – abbreviation for log-likelihood ratio statistic, logarithmic function of LR. LLR meets numbers in interval (-inf;+inf).
- Percentage (normalised) score – commonly used mathematical transformation of the LLR to percentage. This number is better for human readability but may bring some doubts if LLR numbers are too high (typically for some non-adapted installations). Interval <0;100> (or sometimes <0;1>), in %. The higher the score, the better the match.
I always get the same error messages:
- unable to connect to the SPE
- unable to start the localhost: giving up and kill the localhost.
A: This error may happen if the initialization of SPE engine takes too long. Phonexia Browser software treats it as initialization failure and kills the server.
You can fix this by doing the following:
- Increase timeout in Settings > Speech Engine tab > First connection timeout
- Use fewer instances of technologies, thus letting the Speech Engine to start faster
- Use smaller models of technologies
A: We don’t provide USB without memory storage, possible solutions are:
- establish security directives related to work with the USB dongle (persons allowed to, in/out memory scan check),
- use HW based licensing,
- use license server.