Search: channel

21 results

STT: Results explained

…: 0.1612993, “channel” : }, { “time_slot” : 11, “start_time” : 17650000, “end_time” : 21850000, “word” : “tong”, “posterior_probability” : 0.1344358, “channel” : }, { “time_slot” : 11, “start_time” : 17650000, “end_time” : 21850000, “word” : “talk”, “posterior_probability” : 0.0215998, “channel” : }, { “time_slot” : 11, “start_time” : 17650000, “end_time” : 18110000, “word” : “<silence\/>”, “posterior_probability” : 0.3033622, “channel“…

Releases and Changelogs (SPE)

…redirected) New: SID/SID4 stream now allows gradually getting voiceprint from the stream (see /technologies/speakerid4/stream/voiceprint) New: Unicode characters in file names are now supported on Windows platform New: Added LLR score to GID result (as score_llr value, see /technologies/genderid) New: Added ‘per_channel‘ parameter to Diarization for processing multi-channel recordings New: Added configuration option to not start SPE if some technology doesn’t…

Time Analysis Extraction (TAE)

…operator speaks on one channel and caller on another. TAE can process also mono-channel recordings, but it provides limited set of results for dialogue statistic. When the technology is applied on a stream, the results are created and returned on every request, even during an ongoing stream. Output As with the whole SPE, results are provided in form of JSON…

Speaker Identification (SID)

…signal captured in a recording are also more or less unique, thus the technology can be language-, accent-, text-, and channel-independent. Automatic speaker recognition systems are based on the extraction of the unique features from voices and their comparison. The systems thus usually comprise two distinct steps: Voiceprint Extraction (Speaker enrollment) and Voiceprint comparison. The processing speed depends on the…

Input audio quality

…MPEG 2.5 Layer 3 (MP3) with bitrates only 16 or even 12 kbit/s per channel really cripple the audio way too much. If you really have to use MP3, refrain from using joint-stereo encoding1 for 2-channel audio, use full stereo instead. NOTE: If the audio was already heavily compressed, converting it to one of the “okay formats” really does NOT…

Speaker Diarization (DIAR)

Speaker Diarization labels segments of the same voice(s) in one mono-channel audio record based by the individual speaker´s voice. It is a language-, domain- and channel-independent technology. It performs not only the segmentation of speakers but of technical signals and silence as well. The outputs of the technology can be both log files with labels and/or split audio files/one new…

Speech Quality Estimation (SQE)

…channels. The statistics of all channels include the numbers for many aspects of recording quality, and the overall global score. Technology The technology is language-, accent-, text-, and channel– independent Compatibility with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, etc. Input Input format for processing: WAV or RAW (8 or 16 bits…

Age Estimation (AGE)

Phonexia Age Estimation (AGE) estimates the age of a speaker from audio recording or voiceprint. Technology Trained with emphasis on spontaneous telephony conversation The technology is language-, accent-, text-, and channel– independent Compatibility with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, etc. Input Audio: WAV or RAW (8 or 16 bits linear…

KWS: Results explained

…threshold. … { “channel_id“: 0, “score”: 4.5108547, “confidence”: 0.9891304, “start”: 171400000, “end”: 175900000, “word”: “sale_0” }, { “channel_id“: 0, “score”: -1.5344038, “confidence”: 0.17735027, “start”: 246900000, “end”: 251700000, “word”: “sale_1” }, { “channel_id“: 0, “score”: 2.1896133, “confidence”: 0.89931285, “start”: 284100000, “end”: 291000000, “word”: “brazil_0” }, { “channel_id“: 0, “score”: 0.9341812, “confidence”: 0.7179228, “start”: 294900000, “end”: 300600000, “word”: “machine_0” } … …

SID: Speaker Identification: Results Enhancement

Speaker Identification (SID) Results Enhancement is a process that adjusts the score threshold for detecting/rejecting speakers by removing the effect of speech length and audio quality. This is achieved by use of Audio Source Profiles, that represent as closely as possible the source of the speech recording (device, acoustic channel, distance from microphone, language, gender, etc.). Although the out-of-the-box system…

Releases and Changelogs (Browser)

…path to temporary directory contains certain accented characters Fixed: Licensing errors not visible before exiting application Phonexia Browser 3.18.0, BSAPI 3.22.0 (2019-10-03) New: Waveform editor can now process stereo file by Diarization in per-channel mode New: Added Gender balance and Score sharpness in Settings -> Scoring New: Multiple columns in Result pane can be turned on/off at once using context…

LID: Terminology and adaptation

…to train a language using just a few and long audio files (like 5 files, 1 hour each) Acoustic channels should be as close as possible to channel of intended deployment Adaptation using REST API (SPE 3.38 or newer) SPE 3.38 and newer include LID adaptation tasks in REST API, which makes the adaptation significantly easier than in previous versions….

Phonexia Speech Engine

…audio manipulation SPE has built-in basic audio files manipulation functionality, like separating individual channels from stereo recordings, cut one audio to several files, save audio from incoming stream to file and others. Stream audio player To support voicebot scenarios, SPE has the ability to play audiofiles directly to output RTP stream External Text-to-speech (TTS) integration Easy integration with external TTS…

Q: What is the difference between on-the-fly and off-line type of speech to text transcription (STT)?

A: Similarly as human, the ASR (STT) engine is doing the adaptation to an acoustic channel, environment and speaker. Also the ASR (STT) engine is learning more information about the content during time, that is used to improve recognition. The dictate engine, also known as on-the-fly transcription, does not look to the future and has information about just a few…

Gender Identification (GID)

Gender Identification is a language-, domain- and channel-independent technology that uses the acoustic characteristics of the recording to determine the gender of the speaker in question. This technology is able to distinguish between two genders: Male (M) and Female (F). Minimum of speech signal for identification: 7+ sec recommended (with XL4 and L4 model (9+ sec for previous generation of…