Speech To Text for AWS
Speech To Text (STT) for Amazon Web Services is powered by well-known Phonexia Speech Engine (SPE).
Available as AMI template directly from AWS Marketplace, it provides simple RESTful API with the following features:
- File transcription (WAV, FLAC or OPUS format)
- Realtime stream transcription (raw audio via HTTP streams or WebSockets)
- Improving transcription accuracy via preferred phrases
- Adding words to language model (e.g. product- or company names, foreign words, etc.)
- Tuning words pronunciations (e.g. to adapt to regional specifics, mispronunciations, etc.)
NOTE: Some descriptions may refer to the detailed SPE API documentation, for the sake of information accuracy and consistency.
API endpoints are versioned using the URI path: /api/v1/...
API responses contain version
property with API version number.
v2:
- Added “N-best” and “Confusion Network” outputs (see Speech To Text results explained article for details about STT output types)
v1:
- Initial release, aimed to provide simple plain-text transcription, plus “1-best” output with more detailed info about timing and confidence of each individual word
File transcription
Send file for transcription, optionally specifying also list of phrases to be preferred in transcription, and/or words to be added to language model (and optionally their pronunciations) before starting transcription – and get the transcription results in response.
POST /api/v2/stt/file
POST /api/v1/stt/file
Request parameters
none
Request body
Either only audio file, or both audio file and the preferred phrases JSON data, as multipart/form-data
media type, according to RFC 2388 – audio file as file
field, JSON data as preferred_phrases
field.
The optional preferred phrases JSON data must contain at least one of
phrases
array with one or morephrase
to be preferred during the transcriptiondictionary
array with one or moreword
to be added to STT vocabulary before starting the transcription.
Each word definition can optionally have one or morepronunciations
defined usingphonemes
allowed in the transcription language. If a word definition does not have any pronunciation specified explicitly, the system automatically generates default pronunciation based on the word graphemes (letters), following a transcription language rules.
⚠ The automatically generated pronunciation may get incorrect or weird – especially with foreign or non-native words – due to differences in pronunciations of certain graphemes in the transcription language. Therefore it is recommended to define pronunciations explicitly for such words.
ⓘ If word definition uses letters from different alphabet (e.g. German word like “grüßen” in Czech transcription), or even uses different writing script like Cyrillic or Japanese Kana, it MUST be accompanied by a pronunciation definition. And the pronunciation must use only phonemes supported by the Speech To Text model.
See also the generic functions for listing graphemes (letters), phonemes, word classes, or checking the dictionary.
(Optional) preferred phrases JSON data example:
{ "preferred_phrases": { "phrases": [ { "phrase": "disentu" } ], "dictionary": [ { "word": "jummy", "pronunciations": [ { "phonemes": "d Z a m i" } ] }, { "word": "disentu", "pronunciations": [ { "phonemes": "d i z e n t u" } ] } ] } }
Example
Audio file only:
# WAV file: curl -X POST "http://{address}/api/v2/stt/file" -H "Content-Type: multipart/form-data" -F "[email protected];type=audio/wav" # OPUS file: curl -X POST "http://{address}/api/v2/stt/file" -H "Content-Type: multipart/form-data" -F "[email protected];type=audio/ogg"
# WAV file: curl -X POST "http://{address}/api/v1/stt/file" -H "Content-Type: multipart/form-data" -F "[email protected];type=audio/wav" # OPUS file: curl -X POST "http://{address}/api/v1/stt/file" -H "Content-Type: multipart/form-data" -F "[email protected];type=audio/ogg"
Audio file + preferred phrases JSON from file:
curl -X POST "http://{address}/api/v2/stt/file" -H "Content-Type: multipart/form-data" -F "[email protected];type=audio/wav" -F "preferred_phrases=@pref_phr.json;type=application/json"
curl -X POST "http://{address}/api/v1/stt/file" -H "Content-Type: multipart/form-data" -F "[email protected];type=audio/wav" -F "preferred_phrases=@pref_phr.json;type=application/json"
Request responses
200 OK
{ "result": { "version": 2, "name": "CloudSpeechRecognitionResult", "file": "/test.wav", "model": "CS_CZ_6", "transcription": [ { "channel_id": 0, "text": "dalším bodem programu byla tisková konference" } ], "one_best_result": { "confidence": 1, "segmentation": [ { "channel_id": 0, "score": 0, "confidence": 1, "start": 0, "end": 0, "word": "<silence/>" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 0, "end": 0, "word": "<segment>" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 0, "end": 3900000, "word": "dalším" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 3900000, "end": 6900000, "word": "bodem" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 6900000, "end": 11400000, "word": "programu" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 11400000, "end": 11700000, "word": "<silence/>" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 11700000, "end": 13500000, "word": "byla" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 13500000, "end": 13800000, "word": "<silence/>" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 13800000, "end": 18600000, "word": "tisková" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 18600000, "end": 27000000, "word": "konference" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 27000000, "end": 27000000, "word": "</segment>" } ] }, "n_best_result": { "phrase_variants": [ { "variant": [ { "phrase": "dalším bodem programu byla tisková konference", "channel": 0, "score": 0, "confidence": 1, "start": 0, "end": 27000000 } ] }, { "variant": [] }, { "variant": [] }, { "variant": [] } ] }, "confusion_network_result": [ { "time_slot": 0, "start_time": 0, "end_time": 0, "word": "<segment>", "posterior_probability": 1, "channel": 0 }, { "time_slot": 1, "start_time": 0, "end_time": 3900000, "word": "dalším", "posterior_probability": 1, "channel": 0 }, { "time_slot": 2, "start_time": 3900000, "end_time": 6900000, "word": "bodem", "posterior_probability": 1, "channel": 0 }, { "time_slot": 3, "start_time": 6900000, "end_time": 11400000, "word": "programu", "posterior_probability": 1, "channel": 0 }, { "time_slot": 4, "start_time": 11400000, "end_time": 11700000, "word": "<silence/>", "posterior_probability": 1, "channel": 0 }, { "time_slot": 5, "start_time": 11700000, "end_time": 13500000, "word": "byla", "posterior_probability": 1, "channel": 0 }, { "time_slot": 6, "start_time": 13500000, "end_time": 13800000, "word": "<silence/>", "posterior_probability": 1, "channel": 0 }, { "time_slot": 7, "start_time": 13800000, "end_time": 18600000, "word": "tisková", "posterior_probability": 1, "channel": 0 }, { "time_slot": 8, "start_time": 18600000, "end_time": 26980000, "word": "konference", "posterior_probability": 1, "channel": 0 }, { "time_slot": 9, "start_time": 26700000, "end_time": 27000000, "word": "<null/>", "posterior_probability": 0.9375350475311279, "channel": 0 }, { "time_slot": 9, "start_time": 26700000, "end_time": 27000000, "word": "<silence/>", "posterior_probability": 0.06246494874358177, "channel": 0 }, { "time_slot": 10, "start_time": 27000000, "end_time": 27300000, "word": "</segment>", "posterior_probability": 1, "channel": 0 } ], "phrases": [ { "phrase": "disentu" } ], "dictionary": [ { "word": "jummy", "pronunciations": [ { "phonemes": "d Z a m i", "class": "", "out_of_vocabulary": true, "warning_message": "" } ] }, { "word": "disentu", "pronunciations": [ { "phonemes": "d i s e n t u", "class": "", "out_of_vocabulary": false, "warning_message": "" }, { "phonemes": "d i z e n t u", "class": "", "out_of_vocabulary": true, "warning_message": "" } ] } ] } }
{ "result": { "version": 1, "name": "CloudSpeechRecognitionResult", "file": "/test.wav", "model": "CS_CZ_6", "transcription": [ { "channel_id": 0, "text": "Dalším bodem programu byla tisková konference." } ], "confidence": 1, "segmentation": [ { "channel_id": 0, "score": 0, "confidence": 1, "start": 0, "end": 0, "word": "<silence/>" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 0, "end": 0, "word": "<segment>" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 0, "end": 3900000, "word": "dalším" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 3900000, "end": 6900000, "word": "bodem" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 6900000, "end": 11400000, "word": "programu" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 11400000, "end": 11700000, "word": "<silence/>" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 11700000, "end": 13800000, "word": "byla" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 13800000, "end": 18600000, "word": "tisková" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 18600000, "end": 27000000, "word": "konference" }, { "channel_id": 0, "score": 0, "confidence": 1, "start": 27000000, "end": 27000000, "word": "</segment>" } ], "phrases": [ { "phrase": "disentu" } ], "dictionary": [ { "word": "jummy", "pronunciations": [ { "phonemes": "d Z a m i", "out_of_vocabulary": true, "class": "", "warning_message": "" } ] }, { "word": "disentu", "pronunciations": [ { "phonemes": "d i s e n t u", "out_of_vocabulary": false, "class": "", "warning_message": "" }, { "phonemes": "d i z e n t u", "out_of_vocabulary": true, "class": "", "warning_message": "" } ] } ] } }
Error responses
422 Unprocessable entity
500 Internal server error
503 Service unavailable
{ "result" : { "version" : 1, "name" : "CloudErrorResult", "message" : "(1007) Unsupported audio format" } }
Stream transcription (HTTP)
Start transcription
Start transcription of audio streamed via HTTP, optionally specifying a list of phrases to be preferred in transcription, and/or words (and optionally also pronunciations) to be added to language model before starting transcription.
POST /api/v1/stt/stream
Request parameters
frequency
– query parameter, specifying audio sampling frequency in Hz.
If omitted, default value is 8000.
Request body
Either empty, or preferred phrases JSON data (see File transcription for details about preferred phrases).
(Optional) preferred phrases JSON data example:
{ "preferred_phrases": { "phrases": [ { "phrase": "disentu" } ], "dictionary": [ { "word": "jummy", "pronunciations": [ { "phonemes": "d Z a m i" } ] }, { "word": "disentu", "pronunciations": [ { "phonemes": "d i z e n t u" } ] } ] } }
Example
Start transcription only:
curl -X POST "http://{address}/api/v1/stt/stream?frequency=8000"
Start transcription with preferred phrases JSON from file:
curl -X POST "http://{address}/api/v1/stt/stream" -H "Content-Type: application/json" --data-binary @pref_phr.json
Request responses
200 OK
{ "result": { "version": 1, "name": "CloudHttpInputStreamTaskResult", "stream_id": "ec563083-3d9b-457d-a0ac-24b197bc222f", } }
Error responses
422 Unprocessable entity
500 Internal server error
503 Service unavailable
{ "result" : { "version" : 1, "name" : "CloudErrorResult", "message" : "Connection failed" } }
Send audio data
Send audio data to the HTTP stream using HTTP chunked encoding. Audio data must be raw mono 16-bit linear PCM (s16le) audio with sampling frequency specified using frequency
parameter when HTTP stream transcription was started.
NOTE: Start sending audio data within 30 seconds from starting the trancription and obtaining stream ID, otherwise the stream listener times out and sending the audio data fails.
PUT /api/v1/stt/stream/{stream_id}
Request parameters
stream_id
– path parameter, specifying ID of the HTTP stream (obtained when HTTP stream transcription was started)
Request body
Raw binary audio data
Example
# Stream from audio file: ffmpeg -hide_banner -re -i "%AUDIOFILE%" -ac %AUDIO_CHANNELS% -ar %AUDIO_SAMPLERATE% -f s16le -method PUT -multiple_requests 1 http://{address}/api/v1/stt/stream/%STREAM_ID% # Stream from audio device (e.g. microphone) on Windows: ffmpeg -hide_banner -re -f dshow -i audio="%AUDIOSOURCE%" -ac %AUDIO_CHANNELS% -ar %AUDIO_SAMPLERATE% -f s16le -method PUT -multiple_requests 1 http://{address}/api/v1/stt/stream/%STREAM_ID%
where:
%AUDIO_CHANNELS%
is number of audio channels (must be 1!)
%AUDIO_SAMPLERATE%
is sampling frequency (must match the frequency
parameter, see above)
%STREAM_ID%
is the stream ID (must match the stream_id
parameter, see above)
%AUDIOFILE%
is path to audio file name
%AUDIOSOURCE%
is source audio device name (use e.g. ffmpeg -list_devices 1 -f dshow -i dummy
on Windows to list audio devices)
The -re
FFmpeg option is very important, as it makes the audio being streamed at realtime speed only (i.e. at normal playback speed), i.e. simulating the live audio conditions even when using audio file as source. Without this option, the entire audio file content would be pushed through as fast as possible… This would overload the receiving side, causing various undesired effects (garbled or completely missing transcription).
Request responses
200 OK
Error responses
422 Unprocessable entity
500 Internal server error
503 Service unavailable
{ "result" : { "version" : 1, "name" : "CloudErrorResult", "message" : "Connection failed" } }
Get transcription result
GET /api/v2/stt/stream/{stream_id}
GET /api/v1/stt/stream/{stream_id}
Request parameters
stream_id
– path parameter, specifying ID of the HTTP stream (obtained when HTTP stream transcription was started)
Request body
empty
Example
curl "http://{address}/api/v2/stt/stream/ec563083-3d9b-457d-a0ac-24b197bc222f"
curl "http://{address}/api/v1/stt/stream/ec563083-3d9b-457d-a0ac-24b197bc222f"
Request responses
200 OK
{ "result": { "version": 2, "name": "CloudSpeechRecognitionStreamResult", "is_last": true, "task_id": "9792ed71-fac5-478c-bf64-90fc816c7743", "stream_id": "0810746b-5df8-401d-931f-52006d845fe9", "model": "CS_CZ_6", "transcription": [ { "channel_id": 0, "text": "dalším bodem programu byla tisková konference" } ], "one_best_result": { "segmentation": [ { "channel_id": 0, "score": -5.591746, "confidence": 0, "start": 0, "end": 0, "word": "<segment>" }, { "channel_id": 0, "score": -45.757103, "confidence": 0, "start": 0, "end": 3900000, "word": "dalším" }, { "channel_id": 0, "score": -41.768444, "confidence": 0, "start": 3900000, "end": 6900000, "word": "bodem" }, { "channel_id": 0, "score": -65.39638, "confidence": 0, "start": 6900000, "end": 11400000, "word": "programu" }, { "channel_id": 0, "score": -2.708908, "confidence": 0, "start": 11400000, "end": 11700000, "word": "<silence/>" }, { "channel_id": 0, "score": -28.787598, "confidence": 0, "start": 11700000, "end": 13500000, "word": "byla" }, { "channel_id": 0, "score": -2.1874542, "confidence": 0, "start": 13500000, "end": 13800000, "word": "<silence/>" }, { "channel_id": 0, "score": -67.11163, "confidence": 0, "start": 13800000, "end": 18600000, "word": "tisková" }, { "channel_id": 0, "score": -66.78635, "confidence": 0, "start": 18600000, "end": 25800000, "word": "konference" }, { "channel_id": 0, "score": -2.6362, "confidence": 0, "start": 25800000, "end": 26100000, "word": "</segment>" } ], "sentence_info": [ { "confidence": 1 } ] }, "n_best_result": { "phrase_variants": [ { "variant": [ { "phrase": "dalším bodem programu byla tisková konference", "channel": 0, "score": 0, "confidence": 1, "start": 0, "end": 25800000 } ] }, { "variant": [] } ] }, "phrases": [ { "phrase": "disentu" } ], "dictionary": [ { "word": "jummy", "pronunciations": [ { "phonemes": "d Z a m i", "class": "", "out_of_vocabulary": true, "warning_message": "" } ] }, { "word": "disentu", "pronunciations": [ { "phonemes": "d i s e n t u", "class": "", "out_of_vocabulary": false, "warning_message": "" }, { "phonemes": "d i z e n t u", "class": "", "out_of_vocabulary": true, "warning_message": "" } ] } ] } }
{ "result": { "version": 1, "name": "CloudSpeechRecognitionStreamResult", "is_last": true, "task_id": "cfa616cb-b2cd-46e8-8783-533a7719e95a", "stream_id": "0b644140-5ef0-4d02-a62e-ffd291618401", "model": "CS_CZ_6", "transcription": [ { "channel_id": 0, "text": "Dalším bodem programu byla tisková konference." } ], "segmentation": [ { "channel_id": 0, "score": -5.3532896, "confidence": 0, "start": 0, "end": 0, "word": "<segment>" }, { "channel_id": 0, "score": -46.05657, "confidence": 0, "start": 0, "end": 3900000, "word": "dalším" }, { "channel_id": 0, "score": -43.449814, "confidence": 0, "start": 3900000, "end": 6900000, "word": "bodem" }, { "channel_id": 0, "score": -66.04119, "confidence": 0, "start": 6900000, "end": 11400000, "word": "programu" }, { "channel_id": 0, "score": -2.3911438, "confidence": 0, "start": 11400000, "end": 11700000, "word": "<silence/>" }, { "channel_id": 0, "score": -27.298859, "confidence": 0, "start": 11700000, "end": 13800000, "word": "byla" }, { "channel_id": 0, "score": -72.11034, "confidence": 0, "start": 13800000, "end": 18600000, "word": "tisková" }, { "channel_id": 0, "score": -82.402466, "confidence": 0, "start": 18600000, "end": 27300000, "word": "konference" }, { "channel_id": 0, "score": -11.498138, "confidence": 0, "start": 27300000, "end": 27900000, "word": "</segment>" } ], "phrases": [ { "phrase": "disentu" } ], "dictionary": [ { "word": "jummy", "pronunciations": [ { "phonemes": "d Z a m i", "out_of_vocabulary": true, "class": "", "warning_message": "" } ] }, { "word": "disentu", "pronunciations": [ { "phonemes": "d i s e n t u", "out_of_vocabulary": false, "class": "", "warning_message": "" }, { "phonemes": "d i z e n t u", "out_of_vocabulary": true, "class": "", "warning_message": "" } ] } ] } }
Error responses
422 Unprocessable entity
500 Internal server error
503 Service unavailable
{ "result" : { "version" : 1, "name" : "CloudErrorResult", "message" : "Connection failed" } }
Stop transcription
Explicitly stop the HTTP stream transcription immediately.
(Alternatively, you can simply stop sending audio data to the stream. The stream will then automatically stop after 30 seconds timeout.)
DELETE /api/v1/stt/stream/{stream_id}
Request parameters
stream_id
– path parameter, specifying ID of the HTTP stream (obtained when HTTP stream transcription was started)
Request body
empty
Example
curl -X DELETE "http://{address}/api/v1/stt/stream/ec563083-3d9b-457d-a0ac-24b197bc222f"
Request responses
200 OK
Error responses
422 Unprocessable entity
500 Internal server error
503 Service unavailable
{ "result" : { "version" : 1, "name" : "CloudErrorResult", "message" : "Connection failed" } }
Stream transcription (WebSockets)
Open WebSockets connection
Connect to ws://{address}/api/v{version}/websocket/stt
, optionally with frequency
query parameter, specifying audio sampling frequency in Hz (if omitted, default value is 8000). Raw audio data chunks that are expected to be sent later must use this sample rate.
On successful connection, a “connected” response is sent back:
{ "status": "connected", "error": "", "result": {} }
On unsuccessful connection or other error, an “error” response is sent back:
{ "status": "error", "error": "Error description message", "result": {} }
Send preferred phrases definition (optional)
After successful WebSockets connection you can optionally send a message with preferred phrases JSON definition – see the File transcription section for details about preferred phrases.
Send audio data
After successful WebSockets connection – and optionally sending the preferred phrases definition – start sending raw mono 16-bit linear PCM (s16le) audio data chunks in binary WebSocket frames. Audio sample rate must match the frequency specified by the frequency
parameter when opening the WebSocket connection. If these don’t match, various issues might occur (typically, transcription is either completely missing, or results might get garbled).
No confirmation messages are sent on successfully received audiodata, only error messages are sent back.
After sending all audio data, send a special message to notify the system that the audio has finished:
{ "event": "stop" }
Receive transcription result
A “result” responses are sent through the WebSocket periodically; a new message is sent with each transcription update.
The result always contains complete transcription, from the beginning up to the actual timepoint.
Example
{ "status": "result", "error": "", "result": { "version": 2, "name": "CloudSpeechRecognitionStreamResult", "is_last": true, "task_id": "f83fe880-fe1f-4b17-a4cb-36ed9b6d38df", "stream_id": "3ff350cf-dc5d-4c6a-a71e-d6e165a5b12f", "model": "CS_CZ_6", "transcription": [ { "channel_id": 0, "text": "dalším bodem programu byla tisková konference" } ], "one_best_result": { "segmentation": [ { "channel_id": 0, "score": -5.591746, "confidence": 0, "start": 0, "end": 0, "word": "<segment>" }, { "channel_id": 0, "score": -45.757103, "confidence": 0, "start": 0, "end": 3900000, "word": "dalším" }, { "channel_id": 0, "score": -41.768444, "confidence": 0, "start": 3900000, "end": 6900000, "word": "bodem" }, { "channel_id": 0, "score": -65.39638, "confidence": 0, "start": 6900000, "end": 11400000, "word": "programu" }, { "channel_id": 0, "score": -2.708908, "confidence": 0, "start": 11400000, "end": 11700000, "word": "<silence/>" }, { "channel_id": 0, "score": -28.787598, "confidence": 0, "start": 11700000, "end": 13500000, "word": "byla" }, { "channel_id": 0, "score": -2.1874542, "confidence": 0, "start": 13500000, "end": 13800000, "word": "<silence/>" }, { "channel_id": 0, "score": -67.11163, "confidence": 0, "start": 13800000, "end": 18600000, "word": "tisková" }, { "channel_id": 0, "score": -66.78635, "confidence": 0, "start": 18600000, "end": 25800000, "word": "konference" }, { "channel_id": 0, "score": -2.6362, "confidence": 0, "start": 25800000, "end": 26100000, "word": "</segment>" } ], "sentence_info": [ { "confidence": 1 } ] }, "n_best_result": { "phrase_variants": [ { "variant": [ { "phrase": "dalším bodem programu byla tisková konference", "channel": 0, "score": 0, "confidence": 1, "start": 0, "end": 25800000 } ] }, { "variant": [] } ] }, "phrases": [ { "phrase": "disentu" } ], "dictionary": [ { "word": "jummy", "pronunciations": [ { "phonemes": "d Z a m i", "out_of_vocabulary": true, "class": "", "warning_message": "" } ] }, { "word": "disentu", "pronunciations": [ { "phonemes": "d i s e n t u", "out_of_vocabulary": false, "class": "", "warning_message": "" }, { "phonemes": "d i z e n t u", "out_of_vocabulary": true, "class": "", "warning_message": "" } ] } ] } }
{ "status": "result", "error": "", "result": { "version": 1, "name": "CloudSpeechRecognitionStreamResult", "task_id": "cfa616cb-b2cd-46e8-8783-533a7719e95a", "stream_id": "0b644140-5ef0-4d02-a62e-ffd291618401", "model": "CS_CZ_6", "is_last": false, "transcription": [ { "channel_id": 0, "text": "dalším bodem" } ], "segmentation": [ { "channel_id": 0, "score": -5.3532896, "confidence": 0, "start": 0, "end": 0, "word": "<segment>" }, { "channel_id": 0, "score": -46.05657, "confidence": 0, "start": 0, "end": 3900000, "word": "dalším" }, { "channel_id": 0, "score": -43.449814, "confidence": 0, "start": 3900000, "end": 6900000, "word": "bodem" } ], "phrases": [ { "phrase": "disentu" } ], "dictionary": [ { "word": "jummy", "pronunciations": [ { "phonemes": "d Z a m i", "out_of_vocabulary": true, "class": "", "warning_message": "" } ] }, { "word": "disentu", "pronunciations": [ { "phonemes": "d i s e n t u", "out_of_vocabulary": false, "class": "", "warning_message": "" }, { "phonemes": "d i z e n t u", "out_of_vocabulary": true, "class": "", "warning_message": "" } ] } ] } }
Full example
import json import asyncio from concurrent.futures import ALL_COMPLETED from fastapi import WebSocket from pydub import AudioSegment import websockets from websockets.exceptions import ConnectionClosed async def send_raw_chunks(websocket: WebSocket): data = AudioSegment.from_file("test.wav", "wav") for chunk in data[::1000]: await websocket.send(chunk.raw_data) await asyncio.sleep(1) await websocket.send(json.dumps({"event": "stop"})) async def consume(websocket: WebSocket): while True: try: data = await websocket.recv() print(data) except ConnectionClosed: break await asyncio.sleep(0.05) async def example(): async with websockets.connect( "ws://ADDRESS/api/v2/websocket/stt?frequency=8000" ) as websocket: msg = { "data": { "preferred_phrases": { "phrases": [ { "phrase": "disentu" } ], "dictionary": [ { "word": "jummy", "pronunciations": [ { "phonemes": "d Z a m i" } ] }, { "word": "disentu", "pronunciations": [ { "phonemes": "d i z e n t u" } ] } ] } } } await websocket.send(json.dumps(msg)) await asyncio.wait( [ asyncio.create_task(consume(websocket)), asyncio.create_task(send_raw_chunks(websocket)) ], return_when=ALL_COMPLETED ) asyncio.run(example())
Generic functions
List supported graphemes
Returns list of graphemes (letters) supported by the Speech To Text model.
In general, these are to be used in word definitions in the dictionary
part of preferred phrases. If word definition uses graphemes not in this list, the definition MUST be accompanied by a pronunciation definition (which must use only phonemes supported by the Speech To Text model).
GET /api/v1/stt/graphemes
Request parameters
none
Request body
empty
Example
curl "http://{address}/api/v1/stt/graphemes"
Request responses
200 OK
{ "result": { "version": 1, "name": "CloudGraphemesListResult", "graphemes": [ "a", "á", "b", "c", ... ], "word_separator": "+" } }
Error responses
503 Service unavailable
{ "result" : { "version" : 1, "name" : "CloudErrorResult", "message" : "Connection failed" } }
List supported phonemes
Returns list of phonemes supported by the Speech To Text model.
These are the only ones allowed in pronunciation definitions in the dictionary
part of preferred phrases.
GET /api/v1/stt/phonemes
Request parameters
none
Request body
empty
Example
curl "http://{address}/api/v1/stt/phonemes"
Request responses
200 OK
{ "result": { "version": 1, "name": "CloudPhonemesListResult", "phonemes": [ "a", "b", "@", "e", ... ] } }
Error responses
503 Service unavailable
{ "result" : { "version" : 1, "name" : "CloudErrorResult", "message" : "Connection failed" } }
List supported word classes
Returns list of class names supported by the Speech To Text model.
Empty list means that the model does not support classes.
GET /api/v1/stt/classes
The class names can be used as generic placeholders in preferred phrases – use the class name prefixed with $
as class token. Example preferred phrase containing class tokens:
“my name is $female_first_name_nominative and i live in $municipality”.
Request parameters
none
Request body
empty
Example
curl "http://{address}/api/v1/stt/classes"
Request responses
200 OK
{ "result": { "version": 1, "name": "CloudSpeechRecognitionClassesResult", "classes": [ { "name": "female_first_name_nominative" }, { "name": "female_first_name_dative" }, { "name": "female_first_name_instrumental" }, { "name": "female_surname_nominative" }, { "name": "female_surname_dative" }, { "name": "female_surname_instrumental" }, { "name": "male_first_name_nominative" }, { "name": "male_first_name_instrumental" }, { "name": "male_surname_nominative" }, { "name": "male_surname_instrumental" }, { "name": "municipality" }, { "name": "street" } ] } }
Error responses
503 Service unavailable
{ "result" : { "version" : 1, "name" : "CloudErrorResult", "message" : "Connection failed" } }
Check dictionary for issues
Checks the “additional words” dictionary for eventual issues like invalid phonemes or graphemes.
POST /api/v1/stt/checkdictionary
The input dictionary
array must contain one or more word
definitions. Each word definition can optionally have one or more pronunciations
defined using phonemes
allowed in the transcription language. If a word definition does not have any pronunciation specified explicitly, the system generates a default pronunciation based on the word graphemes (letters), following a transcription language rules.
⚠ This may result – especially with foreign or non-native words – in incorrect or weird pronunciations, due to differences in pronunciations of certain graphemes in the transcription language.
ⓘ The word definition allows using letters from different alphabet (e.g. German words like “grüßen” in Czech transcription), or even using different writing script like Cyrillic or Japanese Kana. In such case, the pronunciation definition is mandatory. Missing pronunciation definition results in error and the word is ignored.
The output contains the dictionary provided as input, enriched with
- pronunciations – either internally generated (for words not existing in the internal vocabulary), or defined in the language model (for existing words)
- indication whether the word exists in the internal vocabulary or not
- name of the class the word belongs to
- eventual warning message with details about detected issue with the word/pronunciation
You can review the automatically generated pronunciations and possibly use them as starting point for further fine-tuning of pronunciation variants.
Request parameters
none
Request body
JSON with “additional words” definition
{ "dictionary": [ { "word": "preferred" }, { "word": "phrase", "pronunciations": [ { "phonemes": "f r eI z" } ] } ] }
Example
curl -X POST "http://{address}/api/v1/stt/checkdictionary" -H "Content-Type: application/json" --data-binary @pref_phr.json
Request responses
200 OK
{ "result": { "version": 1, "name": "CloudSpeechToTextCheckDictionaryResult", "dictionary": [ { "word": "preferred", "pronunciations": [ { "phonemes": "p r I f @r d", "out_of_vocabulary": false, "class": "", "warning_message": "" } ] }, { "word": "phrase", "pronunciations": [ { "phonemes": "f r eI z", "out_of_vocabulary": false, "class": "", "warning_message": "" } ] } ] } }
Error responses
422 Unprocessable entity
503 Service unavailable
{ "result" : { "version" : 1, "name" : "CloudErrorResult", "message" : "Connection failed" } }
List available technologies
Returns list of available Speech To Text model names and versions (useful e.g. when contacting support).
Also returns useful status information – total number of available instances (i.e. maximum allowed number of concurrently running transcriptions) and number of busy instances (i.e. number of transcriptions currently running).
GET /api/v1/technologies
Request parameters
none
Request body
empty
Example
curl "http://{address}/api/v1/technologies"
Request responses
200 OK
{ "result": { "version": 1, "name": "CloudTechnologiesResult", "technologies": [ { "name": "Speech To Text", "models": [ { "name": "CS_CZ_6", "version": "6.5.0", "n_total_instancies": 1, "n_busy_instancies": 0 } ] }, { "name": "Speech To Text Input Stream", "models": [ { "name": "CS_CZ_6", "version": "6.5.0", "n_total_instancies": 1, "n_busy_instancies": 0 } ] } ] } }
Error responses
503 Service unavailable
{ "result" : { "version" : 1, "name" : "CloudErrorResult", "message" : "Connection failed" } }