STT: How to properly convert Confusion Network results to One-best
Confusion Network output is the most detailed Speech Engine STT output as it provides multiple word alternatives for individual timeslots of processed speech signal. Therefore many applications want use it as the main source of speech transcription and perform eventual conversion to less verbose output formats internally. This article provides the recommended way to do the conversion.
Time slots and word alternatives:
The recommended algorithm for converting Confusion Network (CN) to One-best is as follows:
- loop through all CN timeslots from start to end
- in each timeslot, get the input alternative with highest score and if it’s not
<null/>
or_DELETE_
- add the input alternative at the end of your output
- in each timeslot, get the input alternative with highest score and if it’s not
- then, loop through all alternatives in your output
- for each alternative, amend its end time to match start time of following alternative
Alternatively, the second step can be done right away when building the result:
- loop through all CN timeslots from start to end
- in each timeslot, get the input alternative with highest score and if it’s not
<null/>
or_DELETE_
- set the end time of last alternative in your output to start time of the input alternative
- add the input alternative at the end of your output
- in each timeslot, get the input alternative with highest score and if it’s not
Example: