Keyword Spotting results explained

This article aims on giving more details about Keyword Spotting outputs and hints on how to tailor Keyword Spotting to suit best your needs.

Scoring and results explanation

Keyword Spotting works by calculating likelihoods that at a given spot occurs a keyword or just any other speech, and comparing those two likelihoods.

The following scheme shows Background model for anything before the keyword (1), the Keyword model (2) and a Background model of any speech parallel with the keyword model (3).
Models 2 and 3 produce two likelihoods – Lkw and Lbg (any speech = background).

Raw score is calculated as log likelihood ratio: score = loge(Lkw/Lbg)

Confidence is calculated from the raw score using a sigmoid function
confidence = sigmoid(sharpness × (score + shift))
where:
shift shifts the score to be 0 at ideal decision point
sharpness specifies how the dynamic range of score is used

Keyword Spotting results contain list of detected keywords, each keyword with a start- and end time of the time slot where keyword was detected, and a score and confidence.

Keyword is listed in the results with a numeric suffix. This number is a 0-based index of the detected pronunciation.
Start- and end time is in HTK units. 1 HTK unit is 100 nanoseconds, so dividing the times by 10000 gives the amount of milliseconds.
Score is log likelihood ratio from {-inf,+inf} interval.
Confidence is a probability from {0,1} interval. To convert it to percentage, multiply the confidence value by 100.

Example

This example of Keyword Spotting results shows:

  • word sale, detected at 2 places, pronounced differently (hence the _0 and _1 suffixes pointing to two different pronunciations defined in the keyword list)
    While the system is pretty sure about the first occurrence (high score and confidence values), the second occurrence is probably a false detection (low score and confidence).
  • words Brazil and machine, both detected with rather high scores and confidences

If e.g. the word sale would be defined with a threshold value e.g. 0.20 in the keyword list, the second occurrence would not appear in results at all, since its confidence is lower than the threshold.

...
{
  "channel_id": 0,
  "score": 4.5108547,
  "confidence": 0.9891304,
  "start": 171400000,
  "end": 175900000,
  "word": "sale_0"
},
{
  "channel_id": 0,
  "score": -1.5344038,
  "confidence": 0.17735027,
  "start": 246900000,
  "end": 251700000,
  "word": "sale_1"
},
{
  "channel_id": 0,
  "score": 2.1896133,
  "confidence": 0.89931285,
  "start": 284100000,
  "end": 291000000,
  "word": "brazil_0"
},
{
  "channel_id": 0,
  "score": 0.9341812,
  "confidence": 0.7179228,
  "start": 294900000,
  "end": 300600000,
  "word": "machine_0"
}
...

 

Posted in Best practices.

Leave a Reply