Speech Quality Estimation (SQE)

Phonexia’s Speech Quality Estimation quantifies the acoustic quality of recordings. This helps the user to quickly determine whether the acoustic quality of a recording is good for processing with other speech technologies or not. As an answer for SQE, the SPE returns a json/xml file. This file includes general information about the technology and statistics of all (one or two) channels. The statistics of all channels include the numbers for many aspects of recording quality, and the overall global score.

Technology

The technology is language-, accent-, text-, and channel- independent
Compatibility with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, etc.

Input

Input format for processing: WAV or RAW (8 or 16 bits linear coding),
A-law or Mu-law, PCM, 8kHz+ sampling

Output

global score – percentage expression of audio quality (range <0;100>), by default, the global score is calculated based on waveform_n_bits and waveform_snr variables.
pesq – value inspired by PESQ (Perceptual Evaluation of Speech Quality). Value is -0.5 to 4.5, the higher rating, the better quality of the recording.

Other important statistics (output)

name – the name of the statistic
value – the measured value of the statistic
min_limit and max_limit – the limits of the statistics possible in the recording based on encoding, frequency and bitrate
string – not used, reserved for future
is_valid – determines if the calculations of the statistic are correct (value=true) or not; e.g. in the case of an empty recording SNR would divide by zero => is_valid would be false
waveform_snr – the signal to noise ratio (SNR) describes the ratio of the useful signal to the noise signal
- it is measured in dB
- calculated from the waveform distribution, (silence – has Gaussian distribution, voice – has Gamma distribution);
- SNR = 20 * log₁₀(S/N)
- technical signal is usually considered to be a useful signal
- the higher the better
- SNR > 15 usually means signal with good quality
- SNR = 0 means the same amount of speech and noise
waveform_max_abs_value – the maximum amplitude; spread of the signal
- without measure
- usual encoding is in range from -32.768 to +32.767; the ideal usage is all across the range
- if less than 5000, the signal does not utilize enough of the spectrum
- in the case of a quiet signal, significant numerical errors will appear
waveform_min_abs_value – the minimum amplitude
- without measure
- values below 1000 mean that the rec is too quiet
- in the case of a quiet signal, significant numerical errors will appear
waveform_clipped_length – the length of a portion of the recording (converted from 25ms frames) that contains some speech clipping
- length given in seconds
- clipping is the removal of signal over the pre-set threshold
- may occur when the original speech before recording is too loud
waveform_n_bits – the number of bits used by the waveform
- absolute value
- if less than 8, the signal has insufficient quality
wfilter_technical_signal_length – the length of technical signals (tones, wide-band noise, etc.), measured in seconds

Processing speed

Approx. 2,000x faster than real-time processing on 1 CPU core
i.e. standard 8 CPU core server processes 384,000 hours of audio in 1 day of computing time

Voice Activity Detection (VAD)

Phoneme Recogniser (PHNREC)

Speech Quality Estimation (SQE)

Previous Article

Next Article

ABOUT PHONEXIA

LEGAL

ACCOUNT

Previous Article

Next Article

Related Articles

ABOUT PHONEXIA

LEGAL

ACCOUNT

TAGS