Understand SPE connectors for external TTS

SPE can be easily connected with external Text-To-Speech (TTS) services using simple connector system. This article describes the principles and how-tos; following this instructions you can create your own connector, allowing to use a custom 3^rd party TTS service via SPE.

The TTS connector should be a command line (CLI) application or script, which communicates with the external TTS service via the service native API and with SPE via standard input (stdin) and output (stdout).
The connector behavior should be as follows:

if connector is started with --info parameter, it outputs TTS service capabilities information data in JSON format to stdout
if connector is started without parameter
- reads input JSON data from stdin
- outputs raw PCM signed 16-bit little-endian mono audio data to stdout
  - SPE 3.46+: with sampling frequency according to naturalSampleRateHertz value returned in capabilities
  - SPE up to 3.45: with fixed sampling frequency 8000 Hz
(optional) if started with --help or -h parameter, connector outputs basic usage text to stdout

Details of the connector behavior are listed below.

TTS service capabilities information

Launching the connector with --info parameter is expected to provide information about actual TTS service capabilities: list of voice names, supported languages and audio quality (sampling frequencies).
This info is used

during SPE startup sequence – TTS connectors enabled in SPE configuration file are started with --info parameter and SPE reads the connector output. Connectors failing to provide the info won’t be available for use with SPE.
when the /external/technologies/tts/info endpoint is called – all successfully initialized TTS connectors (see above) are asked to provide the capabilities information. This is intended to refresh the info from the TTS service.

NOTE: The capabilities info data (voice names, language codes, sampling frequencies) should be obtained from the actual TTS service. Returning just some hardcoded info instead of propagating real capabilities of the TTS service is not a good idea as it might potentially get incorrect over the time, leading to obscure issues in the application relying on the info.

Required capabilities information JSON structure:

{
  "apiVersion": 2,
  "vendor": string,
  "author": string,
  "version": string,
  "voices": [
    {
      "name": string,
      "languageCodes": [string, string, ...],
      "naturalSampleRateHertz": number
    },
    .
    .
    .
  ]
}

{
  "vendor": string,
  "author": string,
  "version": string,
  "voices": [
    {
      "name": string,
      "languageCodes": [string, string, ...]
    },
    .
    .
    .
  ]
}

Where:

apiVersion denotes version of the capabilities structure/API:
- 2: SPE 3.46 and newer
- apiVersion property not present at all for SPE 3.45 and older
vendor is a name of the TTS provider
This name is then used in the POST /external/technologies/tts parameter
author and version are intended for internal connector author description and versioning
voices array should list available TTS voices
- voice name
- list of languageCodes supported by that voice
- SPE 3.46 and newer only: naturalSampleRateHertz, providing default natural sampling rate of the audio

Connector input

The input JSON which should be accepted by the TTS connector from stdin is as follows:

{
  "text": string,
  "voice": {
    "name": string,
    "languageCode": string
  }
}

Where:

text is the text to be synthesized
name is a voice name to be used for synthesis (ref. to the voice names provided in the connector “info” data)
languageCode is a language code defining the language to be used for synthesis (ref. to the connector “info” data)

Connector is responsible for passing the input data to the actual TTS service as needed using the service native API, retrieving the synthesized audio data from the TTS service and outputting the audio to the stdout (see the Connector output section below).

TIP:
The connector can be used even for playing ‘static’ messages from audio files – e.g. the text property can be used for passing the file name to be played… and the audio files can be organized in directories whose names are passed to the connector using the voice name property… or something similar.

Connector output

Output obtained from TTS service should be written by the connector to stdout as raw PCM signed 16-bit little-endian mono audio data.

In SPE 3.46 and newer, the audio sampling frequency must be set to the naturalSampleRateHertz value provided in the TTS service capabilities information.
In SPE 3.45 and older, the audio sampling frequency must be fixed to 8000 Hz.

SPE then reads the audio and writes it either to a file, or to an output realtime stream, according to the original request – see Text To Speech section of REST API documentation.

SPE reads the connector output continuously, i.e. connector can stream the audio data to the stdout as soon as it’s received from the TTS service (if the service supports streaming of the synthesized audio). This can reduce unwanted delays, especially in case of longer texts (taking longer time to synthesize).

Connector naming, location, configuration

TTS connectors should be placed in {SPE_installation_directory}/external/technologies/tts directory, each connector in a separate subdirectory.

To enable a connector, include its subdirectory name to the external.technologies.tts_connectors setting in SPE configuration file.

Connector executable file must be named connector (i.e. without file extension).
Connector configuration – like TTS service address, access credentials, API token, etc. – should be ideally done using separate configuration file, preferrably named connector.properties using .properties-like format (to be consistent with SPE configuration file format).

If all is set and configured properly, SPE should log a successful TTS connector initialization:

TTSSubsystem: Retrieving external connector info from ......./external/technologies/tts/acapela
TTSSubsystem: External connector 'acapela' from ......./external/technologies/tts/acapela has been registered.

If an error occurs, SPE logs the problem:

TTSSubsystem: Retrieving external connector info from ......./external/technologies/tts/acapela
TTSSubsystem: Cannot retrieve external connector info! ERROR: Loading configuration from "......./external/technologies/tts/acapela/connector.properties";Error: acapela server is not running or address and ports are misconfigured;

Understand SPE processing queue

Understand SPE workers configuration

Understand SPE connectors for external TTS

TTS service capabilities information

Connector input

Connector output

Connector naming, location, configuration

Previous Article

Next Article

ABOUT PHONEXIA

LEGAL

ACCOUNT

TTS service capabilities information

Connector input

Connector output

Connector naming, location, configuration

Previous Article

Next Article

Related Articles

ABOUT PHONEXIA

LEGAL

ACCOUNT

TAGS