Language Identification (LID)
Phonexia Language Identification (LID) will help you distinguish the spoken language or dialect. It will enable your system to automatically route valuable calls to your experts in the given language or to send them to other software for analysis.
Application areas
- Preselecting multilingual sources and routing audio files to language-dependent technologies (transcribing, indexing, etc.)
- Analyzing network traffic media (language statistics)
- Routing particular calls (languages) to human operators (language experts)
Scoring and results
The LID language pack defines a set of recognizable languages (represented by a language models).
When identifying the language in audio recording (or languageprint), LID does the following:
- creates languageprint of the recording (if the input is audio recording)
- compares that languageprint with each language model in a language pack
- and calculates probability that these two languages are the same
For explanation of the terms languageprint, language model and language pack, refer to the LID: Terminology and adaptation article.
The final scores are returned as logarithms of these individual probabilities – i.e. as values from {-inf,0} interval – for each language in the language pack.
(to convert raw LID score to percentage, use e score * 100 formula)
LID adaptation (custom language packs)
The scoring principle described above implies that score is distributed among all languages in a language pack.
It means that every language has to score with non-zero value… i.e. that the scores may get diluted as they get spread among many languages.
Additionally, if the language pack contains too many non-equally trained languages (i.e. using very different amount of source audio), the entire system could be influenced and generate low scores even for matching languages.
Therefore it is a good idea to create language pack containing only limited number of languages, e.g. by excluding some really exotic ones, or by keeping only those few languages actually expected in your use case.
This process of tailoring the language pack for particular needs is called language pack adaptation and is described in LID: Terminology and adaptation article.
Example usages of custom language packs
- Law enforcement agency monitoring a network of criminals using only a particular set of languages can use the approach of keeping only languages expected to appear in the traffic.
This can reduce the number of scored languages to like 3 or 5 languages only. - Multilingual call center serving European market can use the approach of excluding languages which surely won’t appear in their traffic – like African ones (Afan, Hausa, …), Asian ones (Chinese, Japanese, …), etc. – while still keeping languages which are less likely, but still possible to appear.
This can reduce the number of scored languages from ~80 languages (included in the default out-of-the-box language pack) to like 20 or even less languages.
In both cases, limiting the number of languages in a language pack results in the scores being distributed among less languages, i.e. the score values getting higher with clearer distinction between languages and clearer gap between best-scoring language and the other ones.
Here is an example of a Turkish phone call identification
You may notice a much sharper score when using a Language pack with only relevant languages (77.3% vs 93,3%):
Using default language pack with 60+ languages |
Using limited language pack with 20 European languages |
||||
---|---|---|---|---|---|
Language | Raw score | Percentage | Language | Raw score | Percentage |
Turkish | -0.258 | 77.270 | Turkish | -0.069 | 93.326 |
Uzbek | -2.436 | 8.753 | Albanian | -4.347 | 1.294 |
Azerbaijani | -3.027 | 4.845 | Hungarian | -4.657 | 0.949 |
Dari | -4.432 | 1.190 | Ukrainian | -5.037 | 0.649 |
Albanian | -5.139 | 0.586 | Swedish | -5.088 | 0.617 |
Tibetan | -5.270 | 0.515 | French | -5.168 | 0.570 |
Georgian | -5.277 | 0.511 | English_British | -5.316 | 0.491 |
Swedish | -5.384 | 0.459 | Macedonian | -5.443 | 0.433 |
Farsi | -5.737 | 0.323 | Greek | -5.698 | 0.335 |
Hungarian | -5.777 | 0.310 | Serbian | -6.002 | 0.247 |
… | … |