Skip to content Skip to main navigation Skip to footer

Speech To Text for AWS

Speech To Text (STT) for Amazon Web Services is powered by well-known Phonexia Speech Engine (SPE).
Available as AMI template directly from AWS Marketplace, it provides simple RESTful API with the following features:

  • File transcription (WAV, FLAC or OPUS format)
  • Realtime stream transcription (raw audio via HTTP streams or WebSockets)
  • Improving transcription accuracy via preferred phrases
  • Adding words to language model (e.g. product- or company names, foreign words, etc.)
  • Tuning words pronunciations (e.g. to adapt to regional specifics, mispronunciations, etc.)

NOTE: Some descriptions refer to the detailed SPE API documentation, for the sake of information accuracy and consistency.

 


API version

API endpoints are versioned using the URI path: /api/v1/...
API responses contain version property with API version number.

Current API version is 1.

 


File transcription

Send file for transcription, optionally specifying also list of phrases to be preferred in transcription, and/or words to be added to language model (and optionally their pronunciations) before starting transcription – and get the transcription results in response.

POST /api/v1/stt/file

Request parameters

none

Request body

Either only audio file, or both audio file and the preferred phrases JSON data, as multipart/form-data media type, according to RFC 2388 – audio file as file field, JSON data as preferred_phrases field.

The optional preferred phrases JSON data must contain at least one of

  • phrases array with one or more phrase to be preferred during the transcription
  • dictionary array with one or more word to be added to STT vocabulary before starting the transcription.
    Each word definition can optionally have one or more
    pronunciations defined using phonemes allowed in the transcription language. If a word definition does not have any pronunciation specified explicitly, the system automatically generates default pronunciation based on the word graphemes (letters), following a transcription language rules.
    The automatically generated pronunciation may get incorrect or weird – especially with foreign or non-native words – due to differences in pronunciations of certain graphemes in the transcription language. Therefore it is recommended to define pronunciations explicitly for such words.
    If word definition uses letters from different alphabet (e.g. German word like “grüßen” in Czech transcription), or even uses different writing script like Cyrillic or Japanese Kana, it MUST be accompanied by a pronunciation definition. And the pronunciation must use only phonemes supported by the Speech To Text model.
    See also the generic functions for listing graphemes (letters), phonemes, word classes, or checking the dictionary.

(Optional) preferred phrases JSON data example:

{
  "preferred_phrases": {
    "phrases": [
      {
        "phrase": "disentu"
      }
    ],
    "dictionary": [
      {
        "word": "jummy",
        "pronunciations": [
          {
            "phonemes": "d Z a m i"
          }
        ]
      },
      {
        "word": "disentu",
        "pronunciations": [
          {
            "phonemes": "d i z e n t u"
          }
        ]
      }
    ]
  }
}

Example

Audio file only:

# WAV file:
curl -X POST "http://{address}/api/v1/stt/file" -H "Content-Type: multipart/form-data" -F "[email protected];type=audio/wav"

# OPUS file:
curl -X POST "http://{address}/api/v1/stt/file" -H "Content-Type: multipart/form-data" -F "[email protected];type=audio/ogg"

Audio file + preferred phrases JSON from file:

curl -X POST "http://{address}/api/v1/stt/file" -H "Content-Type: multipart/form-data" -F "[email protected];type=audio/wav" -F "[email protected]_phr.json;type=application/json"

Request responses

200 OK

{
  "result": {
    "version": 1,
    "name": "CloudSpeechRecognitionResult",
    "file": "/test.wav",
    "model": "CS_CZ_6",
    "transcription": [
      {
        "channel_id": 0,
        "text": "Dalším bodem programu byla tisková konference."
      }
    ],
    "confidence": 1,
    "segmentation": [
      {
        "channel_id": 0,
        "score": 0,
        "confidence": 1,
        "start": 0,
        "end": 0,
        "word": "<silence/>"
      },
      {
        "channel_id": 0,
        "score": 0,
        "confidence": 1,
        "start": 0,
        "end": 0,
        "word": "<segment>"
      },
      {
        "channel_id": 0,
        "score": 0,
        "confidence": 1,
        "start": 0,
        "end": 3900000,
        "word": "dalším"
      },
      {
        "channel_id": 0,
        "score": 0,
        "confidence": 1,
        "start": 3900000,
        "end": 6900000,
        "word": "bodem"
      },
      {
        "channel_id": 0,
        "score": 0,
        "confidence": 1,
        "start": 6900000,
        "end": 11400000,
        "word": "programu"
      },
      {
        "channel_id": 0,
        "score": 0,
        "confidence": 1,
        "start": 11400000,
        "end": 11700000,
        "word": "<silence/>"
      },
      {
        "channel_id": 0,
        "score": 0,
        "confidence": 1,
        "start": 11700000,
        "end": 13800000,
        "word": "byla"
      },
      {
        "channel_id": 0,
        "score": 0,
        "confidence": 1,
        "start": 13800000,
        "end": 18600000,
        "word": "tisková"
      },
      {
        "channel_id": 0,
        "score": 0,
        "confidence": 1,
        "start": 18600000,
        "end": 27000000,
        "word": "konference"
      },
      {
        "channel_id": 0,
        "score": 0,
        "confidence": 1,
        "start": 27000000,
        "end": 27000000,
        "word": "</segment>"
      }
    ],
    "phrases": [
      {
        "phrase": "disentu"
      }
    ],
    "dictionary": [
      {
        "word": "jummy",
        "pronunciations": [
          {
            "phonemes": "d Z a m i",
            "out_of_vocabulary": true,
            "class": "",
            "warning_message": ""
          }
        ]
      },
      {
        "word": "disentu",
        "pronunciations": [
          {
            "phonemes": "d i s e n t u",
            "out_of_vocabulary": false,
            "class": "",
            "warning_message": ""
          },
          {
            "phonemes": "d i z e n t u",
            "out_of_vocabulary": true,
            "class": "",
            "warning_message": ""
          }
        ]
      }
    ]
  }
}
Error responses

422 Unprocessable entity

500 Internal server error

503 Service unavailable

{
  "result" : {
    "version" : 1,
    "name" : "CloudErrorResult",
    "message" : "(1007) Unsupported audio format"
  }
}

 


Stream transcription (HTTP)

Start transcription

Start transcription of audio streamed via HTTP, optionally specifying a list of phrases to be preferred in transcription, and/or words (and optionally also pronunciations) to be added to language model before starting transcription.

POST /api/v1/stt/stream

Request parameters

frequency – query parameter, specifying audio sampling frequency in Hz.
If omitted, default value is 8000.

Request body

Either empty, or preferred phrases JSON data (see File transcription for details about preferred phrases).

(Optional) preferred phrases JSON data example:

{
  "preferred_phrases": {
    "phrases": [
      {
        "phrase": "disentu"
      }
    ],
    "dictionary": [
      {
        "word": "jummy",
        "pronunciations": [
          {
            "phonemes": "d Z a m i"
          }
        ]
      },
      {
        "word": "disentu",
        "pronunciations": [
          {
            "phonemes": "d i z e n t u"
          }
        ]
      }
    ]
  }
}

Example

Start transcription only:

curl -X POST "http://{address}/api/v1/stt/stream?frequency=8000"

Start transcription with preferred phrases JSON from file:

curl -X POST "http://{address}/api/v1/stt/stream" -H "Content-Type: application/json" --data-binary @pref_phr.json

Request responses

200 OK

{
  "result": {
    "version": 1,
    "name": "CloudHttpInputStreamTaskResult",
    "stream_id": "ec563083-3d9b-457d-a0ac-24b197bc222f",
  }
}
Error responses

422 Unprocessable entity

500 Internal server error

503 Service unavailable

{
  "result" : {
    "version" : 1,
    "name" : "CloudErrorResult",
    "message" : "Connection failed"
  }
}

 


Send audio data

Send audio data to the HTTP stream using HTTP chunked encoding. Audio data must be raw mono 16-bit linear PCM (s16le) audio with sampling frequency specified using frequency parameter when HTTP stream transcription was started.
NOTE: Start sending audio data within 30 seconds from starting the trancription and obtaining stream ID, otherwise the stream listener times out and sending the audio data fails.

PUT /api/v1/stt/stream/{stream_id}

Request parameters

stream_id – path parameter, specifying ID of the HTTP stream (obtained when HTTP stream transcription was started)

Request body

Raw binary audio data

Example

# Stream from audio file:
ffmpeg -hide_banner -re -i "%AUDIOFILE%" -ac %AUDIO_CHANNELS% -ar %AUDIO_SAMPLERATE% -f s16le -method PUT -multiple_requests 1 http://{address}/api/v1/stt/stream/%STREAM_ID%

# Stream from audio device (e.g. microphone) on Windows:
ffmpeg -hide_banner -re -f dshow -i audio="%AUDIOSOURCE%" -ac %AUDIO_CHANNELS% -ar %AUDIO_SAMPLERATE% -f s16le -method PUT -multiple_requests 1 http://{address}/api/v1/stt/stream/%STREAM_ID%

where:
%AUDIO_CHANNELS% is number of audio channels (must be 1!)
%AUDIO_SAMPLERATE% is sampling frequency (must match the frequency parameter, see above)
%STREAM_ID% is the stream ID (must match the stream_id parameter, see above)
%AUDIOFILE% is path to audio file name
%AUDIOSOURCE% is source audio device name (use e.g. ffmpeg -list_devices 1 -f dshow -i dummy on Windows to list audio devices)

The -re FFMPEG option is very important, as it makes the audio being streamed at realtime speed only (i.e. at normal playback speed), i.e. simulating the live audio conditions even when using audio file as source. Without this option, the entire audio file content would be pushed through as fast as possible… This would overload the receiving side, causing various undesired effects (garbled or completely missing transcription).

Request responses

200 OK

Error responses

422 Unprocessable entity

500 Internal server error

503 Service unavailable

{
  "result" : {
    "version" : 1,
    "name" : "CloudErrorResult",
    "message" : "Connection failed"
  }
}

 


Get transcription result

GET /api/v1/stt/stream/{stream_id}

Request parameters

stream_id – path parameter, specifying ID of the HTTP stream (obtained when HTTP stream transcription was started)

Request body

empty

Example

curl "http://{address}/api/v1/stt/stream/ec563083-3d9b-457d-a0ac-24b197bc222f"

Request responses

200 OK

{
  "result": {
    "version": 1,
    "name": "CloudSpeechRecognitionStreamResult",
    "task_id": "cfa616cb-b2cd-46e8-8783-533a7719e95a",
    "stream_id": "0b644140-5ef0-4d02-a62e-ffd291618401",
    "model": "CS_CZ_6",
    "transcription": [
      {
        "channel_id": 0,
        "text": "Dalším bodem programu byla tisková konference."
      }
    ],
    "segmentation": [
      {
        "channel_id": 0,
        "score": -5.3532896,
        "confidence": 0,
        "start": 0,
        "end": 0,
        "word": "<segment>"
      },
      {
        "channel_id": 0,
        "score": -46.05657,
        "confidence": 0,
        "start": 0,
        "end": 3900000,
        "word": "dalším"
      },
      {
        "channel_id": 0,
        "score": -43.449814,
        "confidence": 0,
        "start": 3900000,
        "end": 6900000,
        "word": "bodem"
      },
      {
        "channel_id": 0,
        "score": -66.04119,
        "confidence": 0,
        "start": 6900000,
        "end": 11400000,
        "word": "programu"
      },
      {
        "channel_id": 0,
        "score": -2.3911438,
        "confidence": 0,
        "start": 11400000,
        "end": 11700000,
        "word": "<silence/>"
      },
      {
        "channel_id": 0,
        "score": -27.298859,
        "confidence": 0,
        "start": 11700000,
        "end": 13800000,
        "word": "byla"
      },
      {
        "channel_id": 0,
        "score": -72.11034,
        "confidence": 0,
        "start": 13800000,
        "end": 18600000,
        "word": "tisková"
      },
      {
        "channel_id": 0,
        "score": -82.402466,
        "confidence": 0,
        "start": 18600000,
        "end": 27300000,
        "word": "konference"
      },
      {
        "channel_id": 0,
        "score": -11.498138,
        "confidence": 0,
        "start": 27300000,
        "end": 27900000,
        "word": "</segment>"
      }
    ],
    "phrases": [
      {
        "phrase": "disentu"
      }
    ],
    "dictionary": [
      {
        "word": "jummy",
        "pronunciations": [
          {
            "phonemes": "d Z a m i",
            "out_of_vocabulary": true,
            "class": "",
            "warning_message": ""
          }
        ]
      },
      {
        "word": "disentu",
        "pronunciations": [
          {
            "phonemes": "d i s e n t u",
            "out_of_vocabulary": false,
            "class": "",
            "warning_message": ""
          },
          {
            "phonemes": "d i z e n t u",
            "out_of_vocabulary": true,
            "class": "",
            "warning_message": ""
          }
        ]
      }
    ],
    "is_last": true
  }
}
Error responses

422 Unprocessable entity

500 Internal server error

503 Service unavailable

{
  "result" : {
    "version" : 1,
    "name" : "CloudErrorResult",
    "message" : "Connection failed"
  }
}

 


Stop transcription

Explicitly stop the HTTP stream transcription immediately.
(Alternatively, you can simply stop sending audio data to the stream. The stream will then automatically stop after 30 seconds timeout.)

DELETE /api/v1/stt/stream/{stream_id}

Request parameters

stream_id – path parameter, specifying ID of the HTTP stream (obtained when HTTP stream transcription was started)

Request body

empty

Example

curl -X DELETE "http://{address}/api/v1/stt/stream/ec563083-3d9b-457d-a0ac-24b197bc222f"

Request responses

200 OK

Error responses

422 Unprocessable entity

500 Internal server error

503 Service unavailable

{
  "result" : {
    "version" : 1,
    "name" : "CloudErrorResult",
    "message" : "Connection failed"
  }
}

 


Stream transcription (WebSockets)

Open WebSockets connection

Connect to ws://{address}/api/v1/websocket/stt, optionally with frequency query parameter, specifying audio sampling frequency in Hz (if omitted, default value is 8000). Raw audio data chunks that are expected to be sent later must use this sample rate.

On successful connection, a “connected” response is sent back:

{
  "status": "connected",
  "error": "",
  "result": {}
}

On unsuccessful connection or other error, an “error” response is sent back:

{
  "status": "error",
  "error": "Error description message",
  "result": {}
}

Send preferred phrases definition (optional)

After successful WebSockets connection you can optionally send a message with preferred phrases JSON definition – see the File transcription section for details about preferred phrases.

Send audio data

After successful WebSockets connection – and optionally sending the preferred phrases definition – start sending raw mono 16-bit linear PCM (s16le) audio data chunks in binary WebSocket frames. Audio sample rate must match the frequency specified by the frequency parameter when opening the WebSocket connection. If these don’t match, various issues might occur (typically, transcription is either completely missing, or results might get garbled).
No confirmation messages are sent on successfully received audiodata, only error messages are sent back.

After sending all audio data, send a special message to notify the system that the audio has finished:

{
  "event": "stop"
}

Receive transcription result

A “result” responses are sent through the WebSocket periodically; a new message is sent with each transcription update.
The result always contains complete transcription, from the beginning up to the actual timepoint.

Example

{
  "status": "result",
  "error": "",
  "result": {
    "version": 1,
    "name": "CloudSpeechRecognitionStreamResult",
    "task_id": "cfa616cb-b2cd-46e8-8783-533a7719e95a",
    "stream_id": "0b644140-5ef0-4d02-a62e-ffd291618401",
    "model": "CS_CZ_6",
    "is_last": false,
    "transcription": [
      {
        "channel_id": 0,
        "text": "dalším bodem"
      }
    ],
    "segmentation": [
      {
        "channel_id": 0,
        "score": -5.3532896,
        "confidence": 0,
        "start": 0,
        "end": 0,
        "word": "<segment>"
      },
      {
        "channel_id": 0,
        "score": -46.05657,
        "confidence": 0,
        "start": 0,
        "end": 3900000,
        "word": "dalším"
      },
      {
        "channel_id": 0,
        "score": -43.449814,
        "confidence": 0,
        "start": 3900000,
        "end": 6900000,
        "word": "bodem"
      }
    ],
    "phrases": [
      {
        "phrase": "disentu"
      }
    ],
    "dictionary": [
      {
        "word": "jummy",
        "pronunciations": [
          {
            "phonemes": "d Z a m i",
            "out_of_vocabulary": true,
            "class": "",
            "warning_message": ""
          }
        ]
      },
      {
        "word": "disentu",
        "pronunciations": [
          {
            "phonemes": "d i s e n t u",
            "out_of_vocabulary": false,
            "class": "",
            "warning_message": ""
          },
          {
            "phonemes": "d i z e n t u",
            "out_of_vocabulary": true,
            "class": "",
            "warning_message": ""
          }
        ]
      }
    ]
  }
}

 

Full example

import json
import asyncio
from concurrent.futures import ALL_COMPLETED
from fastapi import WebSocket
from pydub import AudioSegment
import websockets
from websockets.exceptions import ConnectionClosed


async def send_raw_chunks(websocket: WebSocket):
    data = AudioSegment.from_file("test.wav", "wav")
    for chunk in data[::1000]:
        await websocket.send(chunk.raw_data)
        await asyncio.sleep(1)
    await websocket.send(json.dumps({"event": "stop"}))


async def consume(websocket: WebSocket):
    while True:
        try:
            data = await websocket.recv()
            print(data)
        except ConnectionClosed:
            break
        await asyncio.sleep(0.05)


async def example():
    async with websockets.connect(
        "ws://ADDRESS/api/v1/websocket/stt?frequency=8000"
    ) as websocket:
        msg = {
            "data": {
                "preferred_phrases": {
                    "phrases": [
                        {
                            "phrase": "disentu"
                        }
                    ],
                    "dictionary": [
                        {
                            "word": "jummy",
                            "pronunciations": [
                                {
                                    "phonemes": "d Z a m i"
                                }
                            ]
                        }, {
                            "word": "disentu",
                            "pronunciations": [
                                {
                                    "phonemes": "d i z e n t u"
                                }
                            ]
                        }
                    ]
                }
            }
        }
        await websocket.send(json.dumps(msg))
        await asyncio.wait(
            [
                asyncio.create_task(consume(websocket)), 
                asyncio.create_task(send_raw_chunks(websocket))
            ], 
            return_when=ALL_COMPLETED
        )


asyncio.run(example())

 


Generic functions

List supported graphemes

Returns list of graphemes (letters) supported by the Speech To Text model.
In general, these are to be used in word definitions in the dictionary part of preferred phrases. If word definition uses graphemes not in this list, the definition MUST be accompanied by a pronunciation definition (which must use only phonemes supported by the Speech To Text model).

GET /api/v1/stt/graphemes

Request parameters

none

Request body

empty

Example

curl "http://{address}/api/v1/stt/graphemes"

Request responses

200 OK

{
  "result": {
    "version": 1,
    "name": "CloudGraphemesListResult",
    "graphemes": [
      "a",
      "á",
      "b",
      "c",
      ...
    ],
    "word_separator": "+"
  }
}
Error responses

503 Service unavailable

{
  "result" : {
    "version" : 1,
    "name" : "CloudErrorResult",
    "message" : "Connection failed"
  }
}

 


List supported phonemes

Returns list of phonemes supported by the Speech To Text model.
These are the only ones allowed in pronunciation definitions in the dictionary part of preferred phrases.

GET /api/v1/stt/phonemes

Request parameters

none

Request body

empty

Example

curl "http://{address}/api/v1/stt/phonemes"

Request responses

200 OK

{
  "result": {
    "version": 1,
    "name": "CloudPhonemesListResult",
    "phonemes": [
      "a",
      "b",
      "@", 
      "e",
      ...
    ]
  }
}
Error responses

503 Service unavailable

{
  "result" : {
    "version" : 1,
    "name" : "CloudErrorResult",
    "message" : "Connection failed"
  }
}

 


List supported word classes

Returns list of class names supported by the Speech To Text model.
Empty list means that the model does not support classes.

GET /api/v1/stt/classes

The class names can be used as generic placeholders in preferred phrases – use the class name prefixed with $ as class token. Example preferred phrase containing class tokens:
“my name is $female_first_name_nominative and i live in $municipality”.

Request parameters

none

Request body

empty

Example

curl "http://{address}/api/v1/stt/classes"

Request responses

200 OK

{
  "result": {
    "version": 1,
    "name": "CloudSpeechRecognitionClassesResult",
    "classes": [
      {
        "name": "female_first_name_nominative"
      },
      {
        "name": "female_first_name_dative"
      },
      {
        "name": "female_first_name_instrumental"
      },
      {
        "name": "female_surname_nominative"
      },
      {
        "name": "female_surname_dative"
      },
      {
        "name": "female_surname_instrumental"
      },
      {
        "name": "male_first_name_nominative"
      },
      {
        "name": "male_first_name_instrumental"
      },
      {
        "name": "male_surname_nominative"
      },
      {
        "name": "male_surname_instrumental"
      },
      {
        "name": "municipality"
      },
      {
        "name": "street"
      }
    ]
  }
}
Error responses

503 Service unavailable

{
  "result" : {
    "version" : 1,
    "name" : "CloudErrorResult",
    "message" : "Connection failed"
  }
}

 


Check dictionary for issues

Checks the “additional words” dictionary for eventual issues like invalid phonemes or graphemes.

POST /api/v1/stt/checkdictionary

The input dictionary array must contain one or more word definitions. Each word definition can optionally have one or more pronunciations defined using phonemes allowed in the transcription language. If a word definition does not have any pronunciation specified explicitly, the system generates a default pronunciation based on the word graphemes (letters), following a transcription language rules.
This may result – especially with foreign or non-native words – in incorrect or weird pronunciations, due to differences in pronunciations of certain graphemes in the transcription language.

The word definition allows using letters from different alphabet (e.g. German words like “grüßen” in Czech transcription), or even using different writing script like Cyrillic or Japanese Kana. In such case, the pronunciation definition is mandatory. Missing pronunciation definition results in error and the word is ignored.

The output contains the dictionary provided as input, enriched with

  • pronunciations – either internally generated (for words not existing in the internal vocabulary), or defined in the language model (for existing words)
  • indication whether the word exists in the internal vocabulary or not
  • name of the class the word belongs to
  • eventual warning message with details about detected issue with the word/pronunciation

You can review the automatically generated pronunciations and possibly use them as starting point for further fine-tuning of pronunciation variants.

Request parameters

none

Request body

JSON with “additional words” definition

{
  "dictionary": [
    {
      "word": "preferred"
    },
    {
      "word": "phrase",
      "pronunciations": [
        {
          "phonemes": "f r eI z"
        }
      ]
    }
  ]
}

Example

curl -X POST "http://{address}/api/v1/stt/checkdictionary" -H "Content-Type: application/json" --data-binary @pref_phr.json

Request responses

200 OK

{
  "result": {
    "version": 1,
    "name": "CloudSpeechToTextCheckDictionaryResult",
    "dictionary": [
      {
        "word": "preferred",
        "pronunciations": [
          {
            "phonemes": "p r I f @r d",
            "out_of_vocabulary": false,
            "class": "",
            "warning_message": ""
          }
        ]
      },
      {
        "word": "phrase",
        "pronunciations": [
          {
            "phonemes": "f r eI z",
            "out_of_vocabulary": false,
            "class": "",
            "warning_message": ""
          }
        ]
      }
    ]
  }
}
Error responses

422 Unprocessable entity

503 Service unavailable

{
  "result" : {
    "version" : 1,
    "name" : "CloudErrorResult",
    "message" : "Connection failed"
  }
}

 


List available technologies

Returns list of available Speech To Text model names and versions (useful e.g. when contacting support).
Also returns useful status information – total number of available instances (i.e. maximum allowed number of concurrently running transcriptions) and number of busy instances (i.e. number of transcriptions currently running).

GET /api/v1/technologies

Request parameters

none

Request body

empty

Example

curl "http://{address}/api/v1/technologies"

Request responses

200 OK

{
  "result": {
    "version": 1,
    "name": "CloudTechnologiesResult",
    "technologies": [
      {
        "name": "Speech To Text",
        "models": [
          {
            "name": "CS_CZ_6",
            "version": "6.5.0",
            "n_total_instancies": 1,
            "n_busy_instancies": 0
          }
        ]
      },
      {
        "name": "Speech To Text Input Stream",
        "models": [
          {
            "name": "CS_CZ_6",
            "version": "6.5.0",
            "n_total_instancies": 1,
            "n_busy_instancies": 0
          }
        ]
      }
    ]
  }
}
Error responses

503 Service unavailable

{
  "result" : {
    "version" : 1,
    "name" : "CloudErrorResult",
    "message" : "Connection failed"
  }
}