Arabic dialects in Phonexia LID and STT

Table of Contents

Arabic language has (a) one standardised variety, and (b) many non-standard varieties (dialects).
In this article, our linguistic team explains differences between Modern Standard Arabic and Arabic dialects in the context of Phonexia Arabic models.

Standard variety: Modern Standard Arabic (MSA)

All Arabs learn it at school (not from their parents, so we cannot say it is their native variety)
It is lingua franca (common language) for the Arabic world – like English for Europeans; however, Arabs speak it much better since they are schooled in MSA from early age
MSA is more similar to some dialects (e.g. Levantine), but it is totally different from others (e.g. Moroccan Arabic) and therefore if you speak only MSA you will understand Levantine a bit, but you won’t understand Moroccan

Data acquisition

AUDIO (used for LID and STT training)
MSA is used in formal speaking situations such as sermons, lectures, news broadcasts, and speeches

so it is pretty difficult/impossible to find recordings of spontaneous phone conversations in MSA
available MSA recordings are usually from broadcasting (microphone) or rather formal scripted speeches (also microphone)

TEXT (used for STT language model training)
MSA is used in all formal writing such as official correspondence, literature, newspapers, webpages

so there is no problem to accumulate loads of texts, but it will be more formal and far from spontaneous speech

Support for MSA in Phonexia products

Name	LID L4	STT	Description
Arabic (MSA)	`arb`	—	Modern Standard Arabic, official language in most countries of the Arab world

Non-standard varieties: Arabic dialects

Dialects are native regional varieties in the Arab world
They are used in all spontaneous contexts (friends, family), but also in more formal environments (work, firms), i.e. dialects are used in all daily life situations, therefore our customers will be more interested in dialects
Depending on criteria chosen for dividing dialects, you can count 10 to 30 dialects of Arabic:

Arabic dialects (image source: Wikipedia)
There is no standard norm for writing, the speakers have two options:
- use MSA for writing something – more formal
- use dialect – more personal, Facebook, Twitter, forums

Data acquisition

AUDIO (used for LID and STT acoustic model training)

There is no difficulty in collecting spontaneous speech in dialect
It might be tricky to create annotations for STT training – the dialect speakers write words down as they hear them, but given the missing standard for writing, different speakers can write words in different ways… i.e. annotations in dialect need to be double-checked and unified

TEXT (used for STT language model training)
Dialects are used for more personal communication, Facebook, Twitter, forums

There is not much material available, since most of the written texts are in MSA
Facebook, Twitter, forums can be used, but they need to be classified, corrected and unified manually – in Phonexia we do not do this

The above are the reasons for limited out-of-the-box support of Arabic dialects in STT.

Support for Arabic dialects in Phonexia products

Name	LID L4	STT	Description
Arabic (Egypt)	`ar-EG`	—	Arabic variety spoken in Egypt
Arabic (Gulf, Kuwait)	`ar-KW`	`AR_KW_6`	Arabic variety as spoken in Kuwait Also spoken in Bahrain, Al Hasa, Qatif, United Arab Emirates, Qatar, Southern Iraq, and Northern Oman
Arabic (Iraq)	`ar-IQ`	—	Arabic variety as spoken in Iraq Also spoken in Syria, Turkey, Iran, Kuwait, Jordan, parts of northern and eastern Arabia
Arabic (Levantine)	`ar-XL`^*	`AR_XL_6`^* `AR_XL_5`^*	Arabic variety spoken in Lebanon, Jordan, Syria, Palestine, Israel, and Turkey
Arabic (Maghrebi, Morocco)	`ar-MA`	—	Arabic variety spoken in Morocco, Algeria, Tunisia, Libya, Western Sahara, Mauritania

^* There are two separate varieties of Levantine Arabic, with separate language code for each – North Levantine (apc) and South Levantine (ajp).
Our models were trained using data from both varieties, therefore we followed RFC 5646, section 2.2.4 and created custom language code ar-XL, where the XL means “cross-Levantine” 😉

NOTE: To get the best STT results, use the model that corresponds to given dialect.
The AR_XL_* model is best suitable for Levantine dialect recordings.
When using AR_XL_* model for neighbor dialect, e.g. Iraqi, the results will be much worse… and for e.g. Maghrebi, the results will be most probably completely unusable.

SID: Speaker Identification: Results Enhancement

STT: Adding words to language model on the fly

Arabic dialects in Phonexia LID and STT

Standard variety: Modern Standard Arabic (MSA)

Data acquisition

Support for MSA in Phonexia products

Non-standard varieties: Arabic dialects

Data acquisition

Support for Arabic dialects in Phonexia products

Previous Article

Next Article

ABOUT PHONEXIA

LEGAL

ACCOUNT

Voice Inspector Categories

Standard variety: Modern Standard Arabic (MSA)

Data acquisition

Support for MSA in Phonexia products

Non-standard varieties: Arabic dialects

Data acquisition

Support for Arabic dialects in Phonexia products

Previous Article

Next Article

Related Articles

ABOUT PHONEXIA

LEGAL

ACCOUNT

TAGS

Voice Inspector Categories