Search Results for: STT performance

Results 1 - 10 of 42 Page 1 of 5
Results per-page: 10 | 20 | 50 | 100

SPE3 – Releases and Changelogs

Relevance: 100%      Posted on: 2020-09-12

Speech Engine (SPE) is developed as RESTfull API on top of Phonexia BSAPI. SPE was formerly known as BSAPI-rest (up to v2.x) or as Phonexia Server (up to v3.2.x). This page lists changes in SPE releases. Releases Changelogs Speech Engine 3.30.13 (09/11/2020) - DB v1401, BSAPI 3.30.13 Public release New: Updated STT and KWS model AR_XL to version 5.1.0 Speech Engine 3.32.0 (08/28/2020) - DB v1500, BSAPI 3.32.0 Non-public Feature Preview release New: Added support for Webhooks and WebSockets in stream processing New: Added support for preferred phrases in 5th generation of STT (see POST /technologies/stt or POST /technologies/stt/input_stream) New:…

Performance of the Speaker Identification 4th generation (SID4): Intel® Xeon® Platinum 8124M

Relevance: 65%      Posted on: 2019-10-30

Benchmark goals Find realistic performance using total recording length Find FTRT based exactly on net_speech (engineering sizing data) Find system performance using all physical cores Find system performance using all logical cores Infrastructure setup Intel® Xeon® Platinum 8124M is used in virtual machine with 8 physical cores reserved exclusively for this VM, Hyper Threading is enabled [16 logical cores available], 32GB RAM, 30GB SSD based storage, 1000 I/O.s-1  reserved per core Benchmark data setup Data set statistic: Number of files: 32 [300 seconds each] RAW recordings length ∑: 9600 [sec] Net speech length ∑: 4224.77 [sec] In the data set…

STT Language Model Customization tutorial

Relevance: 57%      Posted on: 2019-04-24

Language Model Customization tool (LMC) provides a way to improve the Speech To Text performance by creating customized language model. Language model is an important part of Phonexia Speech To Text. In a simplified way it can be imagined as a large dictionary with multiple statistics. The Speech To Text technology uses this dictionary and statistical model to convert audio signals into the proper text equivalents. Due to general diversity of spoken speech, the default generic language model may not acknowledge the importance of certain words over other words in certain situations. Language model customization is a way to inform the…

How to convert STT confusion network results to one-best

Relevance: 30%      Posted on: 2020-04-06

Confusion Network output is the most detailed Speech Engine STT output as it provides multiple word alternatives for individual timeslots of processed speech signal. Therefore many applications want use it as the main source of speech transcription and perform eventual conversion to less verbose output formats internally. This article provides the recommended way to do the conversion. Time slots and word alternatives: The recommended algorithm for converting Confusion Network (CN) to One-best is as follows: loop through all CN timeslots from start to end in each timeslot, get the input alternative with highest score and if it's not <null/> or…

How to configure STT realtime stream word detection parameters

Relevance: 27%      Posted on: 2020-03-28

One of the improvements implemented since Speech Engine 3.24 is neural-network based VAD, used for word- and segment detection. This article describes the segmenter configuration parameters and how they are affecting the realtime stream STT results. The default segmenter parametrs are as shown below: [vad.online_segmenter:SOnlineVoiceActivitySegmenterI] backward_extensions_length_ms=150 forward_extensions_length_ms=750 speech_threshold=0.5 Backward- and forward extension are intervals in miliseconds, which extend the part of the signal going to the decoder. Decoder is a component, which determines what a particular part of the signal contains (speech, silence, etc.). Based on that, decoder also decides whether segment has finished or not. Unlike in file processing…

Difference between on-the-fly and off-line type of transcription (STT)

Relevance: 26%      Posted on: 2017-12-11

Similarly as human, the ASR (STT) engine is doing the adaptation to an acoustic channel, environment and speaker. Also the ASR (STT) engine is learning more information about the content during time, that is used to improve recognition. The dictate engine, also known as on-the-fly transciption, does not look to the future and has information about just a few seconds of speech at the beginning of recordings. As the output is requested immediately during processing of the audio, recording engine can't predict what will come in next seconds of the speech. When access to the whole recording is granted during off-line transcription…


Relevance: 24%      Posted on: 2018-02-01

Phonexia Speech To Text, sometime also as Speech Transcription Technology (LVCSR based ASR technology)

Measuring of a software processing speed – what is the FtRT (Faster than Real Time)

Relevance: 15%      Posted on: 2019-10-30

Faster Than Real Time (FTRT) is metrics developed for defining software performance reference point. Using this metric you can collect "benchmark" data of real processing speed for reviewed software, which should be found - and reproduced - on exactly defined HW. Then, comparing various benchmarks result, you can compare performance of the specified software and its parts on different HW configurations. And vice versa - using the same metric you can compare software from different vendors on the same HW configuration and for the same processing task. We are recognizing two measurable metrics: Recording based FTRT is calculated from real…