Arabic dialects in Phonexia LID and STT
Arabic language has (a) one standardised variety, and (b) many non-standard varieties (dialects).
In this article, our linguistic team explains differences between Modern Standard Arabic and Arabic dialects in the context of Phonexia Arabic models.
Standard variety: Modern Standard Arabic (MSA)
- All Arabs learn it at school (not from their parents, so we cannot say it is their native variety)
- It is lingua franca (common language) for the Arabic world – like English for Europeans; however, Arabs speak it much better since they are schooled in MSA from early age
- MSA is more similar to some dialects (e.g. Levantine), but it is totally different from others (e.g. Moroccan Arabic) and therefore if you speak only MSA you will understand Levantine a bit, but you won’t understand Moroccan
Data acquisition
AUDIO (used for LID and STT training)
MSA is used in formal speaking situations such as sermons, lectures, news broadcasts, and speeches
- so it is pretty difficult/impossible to find recordings of spontaneous phone conversations in MSA
- available MSA recordings are usually from broadcasting (microphone) or rather formal scripted speeches (also microphone)
TEXT (used for STT language model training)
MSA is used in all formal writing such as official correspondence, literature, newspapers, webpages
- so there is no problem to accumulate loads of texts, but it will be more formal and far from spontaneous speech
Support for MSA in Phonexia products
Name | LID L4 | STT | Description |
---|---|---|---|
Arabic (MSA) | arb |
— | Modern Standard Arabic, official language in most countries of the Arab world |
Non-standard varieties: Arabic dialects
- Dialects are native regional varieties in the Arab world
- They are used in all spontaneous contexts (friends, family), but also in more formal environments (work, firms), i.e. dialects are used in all daily life situations, therefore our customers will be more interested in dialects
- Depending on criteria chosen for dividing dialects, you can count 10 to 30 dialects of Arabic:
- There is no standard norm for writing, the speakers have two options:
- use MSA for writing something – more formal
- use dialect – more personal, Facebook, Twitter, forums
Data acquisition
AUDIO (used for LID and STT acoustic model training)
- There is no difficulty in collecting spontaneous speech in dialect
- It might be tricky to create annotations for STT training – the dialect speakers write words down as they hear them, but given the missing standard for writing, different speakers can write words in different ways… i.e. annotations in dialect need to be double-checked and unified
TEXT (used for STT language model training)
Dialects are used for more personal communication, Facebook, Twitter, forums
- There is not much material available, since most of the written texts are in MSA
- Facebook, Twitter, forums can be used, but they need to be classified, corrected and unified manually – in Phonexia we do not do this
The above are the reasons for limited out-of-the-box support of Arabic dialects in STT.
Support for Arabic dialects in Phonexia products
Name | LID L4 | STT | Description |
---|---|---|---|
Arabic (Egypt) | ar-EG |
— | Arabic variety spoken in Egypt |
Arabic (Gulf, Kuwait) | ar-KW |
AR_KW_6 |
Arabic variety as spoken in Kuwait Also spoken in Bahrain, Al Hasa, Qatif, United Arab Emirates, Qatar, Southern Iraq, and Northern Oman |
Arabic (Iraq) | ar-IQ |
— | Arabic variety as spoken in Iraq Also spoken in Syria, Turkey, Iran, Kuwait, Jordan, parts of northern and eastern Arabia |
Arabic (Levantine) | ar-XL * |
AR_XL_6 *AR_XL_5 * |
Arabic variety spoken in Lebanon, Jordan, Syria, Palestine, Israel, and Turkey |
Arabic (Maghrebi, Morocco) | ar-MA |
— | Arabic variety spoken in Morocco, Algeria, Tunisia, Libya, Western Sahara, Mauritania |
* There are two separate varieties of Levantine Arabic, with separate language code for each – North Levantine (apc
) and South Levantine (ajp
).
Our models were trained using data from both varieties, therefore we followed RFC 5646, section 2.2.4 and created custom language code ar-XL
, where the XL means “cross-Levantine” 😉
NOTE: To get the best STT results, use the model that corresponds to given dialect.
The AR_XL_* model is best suitable for Levantine dialect recordings.
When using AR_XL_* model for neighbor dialect, e.g. Iraqi, the results will be much worse… and for e.g. Maghrebi, the results will be most probably completely unusable.