Phonexia Speech Engine
Phonexia Speech Engine (SPE) is main part of Phonexia Speech Platform.
SPE is a server application for 64-bit Linux or Windows, providing REST API to entire portfolio of Phonexia speech technologies.
SPE capabilities overview:
- Audio files and stream processing
Audio files RTP / HTTP streams Speaker Identification (SID) ✓ ✓ Speech To Text (STT) ✓ ✓ Keyword Spotting (KWS) ✓ ✓ Voice Activity Detection (VAD) ✓ ✓ Time Analysis Extraction (TAE) ✓ ✓ Speech Quality Estimation (SQE) ✓ ✓ Language Identification (LID) ✓ Gender Identification (GID) ✓ Age Estimation (AGE) ✓ Speaker Diarization (DIAR) ✓ - Results caching
Processing results can be optionally stored in results cache database to speed up eventual re-processing of the same recordings by the same technology – results are then returned immediately from the cache instead of complete re-processing of the audio file. - Own persistent data storage
SPE keeps uploaded audio files in its own persistent storage space, so the original source files can be archived or deleted after upload. - Data privacy
SPE keeps information about audio file or stream only as long as the file or stream exists. Once the recording is deleted from SPE storage, or stream is ended, SPE removes all information, metadata and technology results from the database. - Basic user management
SPE allows to define multiple users with different user roles and user rights. Each SPE user has access only to its own data storage, files, metadata and processing results. - Load management
SPE manages its own queue of incoming REST requests and serves them according to available capacity of current installation. This means that the application layer can request any number of queries and then just wait untill they are processed. - Processing priority management
To allow off-queue high-priority or low-priority processing, SPE also allows to set priority for individual REST requests. - Basic audio manipulation
SPE has built-in basic audio files manipulation functionality, like separating individual channels from stereo recordings, cut one audio to several files, save audio from incoming stream to file and others. - Stream audio player
To support voicebot scenarios, SPE has the ability to play audiofiles directly to output RTP stream - External Text-to-speech (TTS) integration
Easy integration with external TTS providers via simple plugin-like connectors interface - Flexible integration
SPE can provide results in JSON or XML format. Result can be obtained by polling, via websockets, or via webhooks (callbacks). - Status information
SPE can provide various status information to the application layer, e.g. license status, configuration info, current overall load, pending operations status, …
Quick start
The following tutorial describes the first steps with Speech Engine, after obtaining a license file from Phonexia and downloading the Speech Engine package using a link provided by Phonexia.
In short, these are the steps as described in the tutorial:
- Unzip the package to a directory
- Copy license file into the same directory
- Run
phxadmin --configure-tech
in console to configure technologies - Edit
settings/phxspe.properties
configuration file to configure server and optionally database - Optionally, run
phxadmin --add-user
in console to configure user account(s) to access the REST API (or use pre-configured useradmin
) - Finally, run
phxspe
in console to start Speech Engine
Now your SPE server is running and you can access the REST API via IP address and port set in properties file (settings/phxspe.properties
).
Details for steps 3 to 5 are described in doc/INSTALL.html
included in the distribution package.
REST API documentation is in doc/api_reference.html
file and also available online at https://download.phonexia.com/docs/spe/.
Speech Engine is actively developed and continually improved – check the SPE changelog for latest news.
Architecture and components
SPE is application run from command line or as a service. Apart from running main binary file itself SPE requires database, which might be SQLite (delivered inside Phonexia package) or MySQL. No other components are needed.
Structure of Technologies and technology models
From the technical point of view, every technology can work with different technology modules. These are various languages for STT (CS_CZ4, EN_US4), or various sizes for SID (L3, XL3). Technology can work with one module only, or with any number of modules which are installed.
Inside of SPE there is core application together with technologies and technology modules. All the technologies are included in ./bsapi
sub-folder. Every technology has separate sub-folder named by technology shortcut, and can include one or more modules. Modules are store down in directory structure.