Skip to content Skip to main navigation Skip to footer

Speech Quality Estimation (SQE)

Phonexia’s Speech Quality Estimation quantifies the acoustic quality of recordings. This helps the user to quickly determine whether the acoustic quality of a recording is good for processing with other speech technologies or not. As an answer for SQE, the SPE returns a json/xml file. This file includes general information about the technology and statistics of all (one or two) channels. The statistics of all channels include the numbers for many aspects of recording quality, and the overall global score.

Technology

  • The technology is language-, accent-, text-, and channel- independent
  • Compatibility with the widest range of audio sources possible (applies channel compensation techniques): GSM/CDMA, 3G, VoIP, landlines, etc.

Input

  • Input format for processing: WAV or RAW (8 or 16 bits linear coding),
    A-law or Mu-law, PCM, 8kHz+ sampling

Output

  • global score – percentage expression of audio quality (range <0;100>), by default, the global score is calculated based on waveform_n_bits and waveform_snr variables.
  • pesq – value inspired by PESQ (Perceptual Evaluation of Speech Quality). Value is -0.5 to 4.5, the higher rating, the better quality of the recording.

Other important statistics (output)

  • name – the name of the statistic
  • value – the measured value of the statistic
  • min_limit and max_limit – the limits of the statistics possible in the recording based on encoding, frequency and bitrate
  • string – not used, reserved for future
  • is_valid – determines if the calculations of the statistic are correct (value=true) or not; e.g. in the case of an empty recording SNR would divide by zero => is_valid would be false
  • waveform_snr – the signal to noise ratio (SNR) describes the ratio of the useful signal to the noise signal
    • it is measured in dB
    • calculated from the waveform distribution, (silence – has Gaussian distribution, voice – has Gamma distribution);
    • SNR = 20 * log10(S/N)
    • technical signal is usually considered to be a useful signal
    • the higher the better
    • SNR > 15 usually means signal with good quality
    • SNR = 0 means the same amount of speech and noise
  • waveform_max_abs_value – the maximum amplitude; spread of the signal
    • without measure
    • usual encoding is in range from -32.768 to +32.767; the ideal usage is all across the range
    • if less than 5000, the signal does not utilize enough of the spectrum
    • in the case of a quiet signal, significant numerical errors will appear
  • waveform_min_abs_value – the minimum amplitude
    • without measure
    • values below 1000 mean that the rec is too quiet
    • in the case of a quiet signal, significant numerical errors will appear
  • waveform_clipped_length – the length of a portion of the recording (converted from 25ms frames) that contains some speech clipping
    • length given in seconds
    • clipping is the removal of signal over the pre-set threshold
    • may occur when the original speech before recording is too loud
  • waveform_n_bits – the number of bits used by the waveform
    • absolute value
    • if less than 8, the signal has insufficient quality
  • wfilter_technical_signal_length – the length of technical signals (tones, wide-band noise, etc.), measured in seconds

Processing speed

Approx. 2,000x faster than real-time processing on 1 CPU core
i.e. standard 8 CPU core server processes 384,000 hours of audio in 1 day of computing time

Related Articles