Top

Adapting AI to identify Arabic dialects

The novel AI approach, known as parameter-efficient learning, identifies Arabic dialects by using limited data and computing power.

Arabic, one of the world’s most complex and rich languages, is spoken across many countries in Asia and Africa and is recognized as an official language in more than 22 countries. However, significant differences exist between standard written Arabic and the local colloquial dialects spoken in each region. 

Dialect identification is more difficult for speech recognition models than language identification, explains KAUST research scientist Sumeer Khan. “Dialects share similar acoustic and linguistic characteristics compared to different languages. Tiny differences in pronunciation and accent are used as cues to identify dialects.” 

This was the challenge for Srijith Radhakrishnan, an intern in Jesper Tegner’s Living Systems Laboratory, who led the project on developing a model for Arabic dialect identification, which also involved collaboration with former KAUST intern Chao-Han Huck Yang at Amazon Research, now at NVIDIA Research.  

Using a large open-access speech recognition model known as Whisper as a baseline, the researchers fine-tuned it with an online dataset comprising 17 different dialects of Arabic. They adapted the resource-constrained and data-limited conditions model by training small additive modules embedded into the “frozen” pretrained model.  

“We made some technical adjustments in the architecture and investigated different designs to incorporate trainable features into the frozen model,” says Khan. “This means less training time and resources are required to fine-tune it.”   

They achieved high accuracy regarding dialect identification on the dataset using only 2.5% of the parameters of the larger model and 30% of the training data. 

This novel approach, known as parameter-efficient learning, aims to reduce the computational cost of machine learning by adapting large pretrained models to downstream tasks using only a small subset of parameters.  

“The main limitation of the large models is that they require a lot of resources, computer power and large data sets. Our goal was to reconfigure and reprogram the architecture so we can use it in these resource-constrained scenarios,” says Kahn. “This type of experiential efficient learning is badly needed in many fields as not everyone has the resources to train these models,” he says. Srijith’s two papers have already been published and presented at premier language and speech conferences, such as the Conference on Empirical Methods in Natural Language Processing 2023, and an oral presentation at the Interspeech conference 2023. 

While dialect identification could be a pathway to building an Arabic speech-recognition system, Tegner believes the research has broader applications. 

“In a broader context, if you can convert spoken language to text, then you can think of text mapping to speech, connect to images and make stories. You build an ecosystem and the spoken language is part of it. This project is the first step. We can combine different forms of information — speech, text and images,” he explains.  

Tegner thinks the work could have applications in fields like health and medicine, for example, where speech, text and images could be combined to enhance patient understanding. 

“Such an integrated approach would ensure clear communication, improve accessibility and engage patients more effectively, leading to better understanding and adherence to medical advice,” he says. 

The project has also opened doors for former intern Radhakrishnan. “He came to KAUST as an undergrad and proved to be excellent; he’s now been motivated to apply for a master’s and Ph.D. and has applied to several top universities in the US,” says Tegner, his former supervisor.