Google and a consortium of African research institutions have launched the WAXAL dataset, a major new effort to correct one of artificial intelligence’s (AI) major challenges on the continent, its inability to interpret and understand most African languages.
The project delivers a large, open speech dataset spanning 21 Sub-Saharan African languages and brings voice technology to more than 100 million people excluded from the AI economy.
The WAXAL dataset is the product of a three-year collaboration funded by Google and led by local universities and community groups.
It includes 1,250 hours of transcribed, natural speech and more than 20 hours of studio-grade recordings aimed at building high-fidelity synthetic voices. It targets languages such as Hausa, Yoruba, Luganda, Igbo and Acholi, many of which are spoken by tens of millions but remain largely invisible to commercial speech systems.
For all the talk of global AI, voice technologies still lean heavily towards English and a narrow handful of European and Asian languages. Africa, home to over 2,000 languages, has been left on the margins.
That gap is not academic; it shapes who can use digital services, who can access education and healthcare tools, and who gets to build companies on top of modern AI platforms. Google framed the work as a step toward narrowing a long-standing data gap that has kept many African languages off voice assistants and other tools.
Beyond addressing this imbalance directly, the project matters as much as the data itself.
Unlike earlier initiatives where African speech data was extracted and owned elsewhere, WAXAL was led on the ground by African institutions. Makerere University in Uganda, the University of Ghana, and Digital Umuganda in Rwanda oversaw data collection, community engagement, and language stewardship, with technical support from Google Research Africa.
Crucially, those institutions retain ownership of the data. That is a notable shift in a field often criticised for reproducing extractive dynamics under the banner of openness.
According to Aisha Walcott-Bryant, Head of Google Research Africa, “The ultimate impact of WAXAL is the empowerment of people in Africa. This dataset provides the critical foundation for students, researchers, and entrepreneurs to build technology on their own terms, in their own languages, finally reaching over 100 million people.”
“We look forward to seeing African innovators use this data to create everything from new educational tools to voice-enabled services that create tangible economic opportunities across the continent”, she added.
Aisha Walcott-Bryant, Head of Google Research Africa
That framing is echoed by the universities involved. Joyce Nakatumba-Nabende, a senior lecturer at Makerere University, said:
“For AI to have a real impact in Africa, it must speak our languages and understand our contexts. The WAXAL dataset gives our researchers the high-quality data they need to build speech technologies that reflect our unique communities. In Uganda, it has already strengthened our local research capacity and supported new student- and faculty-led projects.”
At the University of Ghana, Associate Professor Isaac Wiafe pointed to the scale of public engagement:
“For us at the University of Ghana, WAXAL’s impact goes beyond the data itself. It has empowered us to build our own language resources and train a new generation of AI researchers. Over 7,000 volunteers joined us because they wanted their voices and languages to belong in the digital future. Today, that collective effort has sparked an ecosystem of innovation in fields like health, education, and agriculture. This proves that when the data exists, possibility expands everywhere.”
There is reason for cautious optimism. Open speech datasets can lower barriers for local startups and researchers who lack the resources to collect data at scale. They can also reduce reliance on foreign APIs that rarely support African languages well, if at all.
The WAXAL dataset
Still, datasets do not guarantee outcomes; building reliable voice systems requires sustained investment, local deployment, and commercial pathways that keep value in-country. Google’s role as funder and convenor will invite scrutiny, particularly around how WAXAL data is used by global companies in the future.
For now, the release of the WAXAL dataset marks a concrete step towards a more linguistically inclusive AI ecosystem. It does not solve Africa’s AI challenges, but it addresses a foundational one. Voice is often the most natural interface with technology. Making sure AI can hear Africa speak, in all its diversity, is long overdue.
The post Google to train AI in 21 African languages, including Yoruba, Hausa and Igbo first appeared on Technext.


