Home
Scaling Multilingual Large Language Models: How Data Balance and Coverage Drive Performance

Scaling Multilingual Large Language Models: How Data Balance and Coverage Drive Performance

Mark Chomiczewski
5 December 2025
9 Comments

Most multilingual AI models today are broken-not because they’re poorly built, but because they’re trained on bad data. If you feed a language model 90% English and 1% Swahili, it won’t magically learn Swahili well. It’ll just be bad at Swahili. And that’s not just a flaw-it’s a systemic problem in how we scale AI across languages. The truth is, throwing more data at the problem doesn’t fix imbalance. You need data balance-not more data, but smarter data.

Why Proportional Sampling Fails

Early multilingual models like mT5 and mBART used proportional sampling: if English had 100 billion tokens and Swahili had 1 billion, Swahili got 1% of the training data. Sounds fair, right? It’s not. That approach created massive performance gaps. High-resource languages like English, Chinese, or Spanish performed at 95%+ accuracy on translation and comprehension tasks. Low-resource languages like Bengali, Swahili, or Guarani? Often below 50%. That’s not a small difference-it’s a chasm.

Why does this happen? Because language isn’t just about volume. It’s about structure, context, and exposure. A language with 10 million tokens might have richer grammar, more varied syntax, and deeper cultural nuance than a language with 100 million tokens that’s just scraped from low-quality web pages. Proportional sampling ignores all of that. It treats data like a bucket of sand-more sand, better model. But you’re not building a sandcastle. You’re teaching a brain to understand human communication.

The Scaling Law Breakthrough

In 2024, researchers from Meta and Google published a landmark study that changed everything. They trained over 100 models across 23 languages, from 85 million to 1.2 billion parameters, and found something surprising: the optimal way to balance data doesn’t depend on the model size. The same sampling ratios that worked for a small 85M model also worked for a 1.2B model. That’s huge. It means you don’t need to train a giant model just to figure out the right mix.

Here’s how it works: instead of sampling by raw token count, you sample by what the model actually needs to learn. For languages with around 1 billion tokens of clean data, the sweet spot was 0.7% of total training tokens. For English, which had over 100 billion tokens, it was only 0.3%. That’s right-English got less of the pie, even though it had 100x more data. Why? Because it was already overrepresented. The model didn’t need more English. It needed more Swahili, more Tamil, more Quechua.

The result? Low-resource languages saw a 15-22% drop in cross-entropy loss-the standard measure of prediction error. That’s not a minor improvement. It’s the difference between a chatbot that fails 4 out of 10 times in Swahili and one that fails only once. And crucially, the overall model efficiency stayed at 98% of the original. No slowdown. No bloated compute. Just better performance where it mattered most.

What About Cross-Lingual Transfer?

Here’s the twist: a big chunk of the gains in low-resource languages didn’t come from direct training data. They came from transfer. Up to 45% of the improvement in Swahili performance, for example, came from the model learning patterns from related languages like Kinyarwanda or Zulu. That’s the power of linguistic families.

This is why grouping languages by family matters. Indo-European, Sino-Tibetan, Japonic, Koreanic, Dravidian-these aren’t just academic labels. They’re clusters of languages that share grammar, syntax, or vocabulary roots. When you train a model on Hindi and Bengali together, it learns patterns that help with Tamil and Telugu. But if you treat each language in isolation, you’re wasting that potential.

The 2024 study found that when you group languages correctly and apply the optimal sampling ratios, low-resource languages hit 92-95% of the performance of high-resource ones. That’s not perfect-but it’s close enough to be usable in real applications. Customer service bots, translation tools, content moderation systems-all of them suddenly become viable for languages that were previously ignored.

A neural network with thick English pathways and thin low-resource language lines, connected by golden linguistic family bonds.

Why Other Methods Fall Short

You might have heard of temperature-based sampling (used in Meta’s NLLB) or language-adaptive layers (Google’s approach). Both help-but at a cost.

Temperature sampling boosts low-resource language performance by 18-25%, but it slows down training by 12-15%. That’s expensive. Language-adaptive layers improve Swahili accuracy by 22-30%, but they add 15-20% latency during inference. For a real-time chatbot, that’s noticeable. And then there’s PaLM 2’s strategy: just make the model 340B parameters and hope proportional sampling works out. It closes the gap to 25-30%, but it costs 3.5x more in compute. Most companies can’t afford that.

The optimal sampling approach doesn’t need fancy architectures or massive models. It just needs the right ratios. And those ratios can be calculated from a small model in a few days. That’s why companies like Meta are now releasing Llama-Multilingual variants with these ratios baked in. They’re not just bigger-they’re smarter.

Practical Limits and Real-World Gaps

But here’s the hard truth: scaling laws have limits. Languages with fewer than 50 million tokens show diminishing returns. No matter how much you sample them, the model can’t learn enough. Guarani, with less than 1 million tokens, still underperforms by 35-40% even with optimal sampling. That’s not a failure of the method-it’s a failure of data.

And then there’s code-switching. In real life, people don’t speak one language at a time. They mix. A Nigerian user might type in English, switch to Yoruba mid-sentence, then drop in Pidgin. Most training data doesn’t capture this. The 2024 study didn’t include code-switched data. And that’s a problem. Google’s Sebastian Ruder pointed out that code-switching makes up 15-20% of natural multilingual communication. If your model can’t handle it, it’ll fail in the real world.

Also, tokenization varies. Turkish, with its complex agglutination, needs 25-30% more raw tokens than English to cover the same vocabulary. If you don’t account for that, you’re underestimating how much data you actually need.

Who’s Using This Right Now?

Enterprise adoption is accelerating. Financial services, e-commerce, and government agencies are the biggest users. Why? Because they serve customers in multiple languages-and regulators are watching.

The EU’s AI Act, effective February 2025, requires companies to prove their AI systems treat all supported languages fairly. No more “English works great, the rest is a bonus.” You need evidence. And optimal sampling gives you that. It’s mathematically grounded. It’s reproducible. It’s auditable.

On the ground, companies are seeing results. One AWS customer reduced language-specific chatbot failures from 22% to 8% across 15 languages. Another cut training costs by $1.5 million per model iteration. And developers on Hugging Face reported 27 BLEU point gains in Swahili translation with just 15% more training time.

The market is shifting. In 2022, only 22% of enterprises used multilingual AI. By 2024, it was 57%. And 73% of them are now reducing their training data volume because they learned: more data isn’t the answer. Better data is.

Developers in a dim server room watching multilingual chatbots, a paper with optimal sampling ratios on the wall.

What You Need to Implement This

If you want to apply this today, here’s what you need:

A way to identify languages accurately (LIUM toolkit or similar, with >99.5% accuracy)
A classification of languages into families (World Atlas of Language Structures is the standard)
A small-scale model (85M parameters) to test sampling ratios before scaling up
Tools to handle code-switched data (you’ll need custom preprocessing)
Access to the multilingual-scaling-tools GitHub repo (1,247 stars, maintained by the original researchers)

You don’t need a supercomputer. You don’t need a PhD. You need to stop treating all languages the same. Start with a small experiment. Train a model with 0.7% Swahili, 0.3% English, 0.5% Bengali. Compare it to proportional sampling. You’ll see the difference in a week.

The Future: Beyond 23 Languages

The big question now is: can this scale to 7,000 languages? Probably not all of them. But it can scale to the 1,000 most spoken. And that’s enough to cover 95% of the world’s population.

Researchers are already working on dynamic sampling-adjusting ratios during training based on real-time performance. Google’s December 2024 announcement showed 8-12% extra gains for struggling languages by doing this. And multimodal models like PaLI-X are proving that scaling vision and language together boosts multilingual captioning accuracy by over 22%.

The future isn’t about building one giant model that speaks everything. It’s about building many small, smart models that speak what matters-efficiently, fairly, and accurately.

Final Thought: It’s Not About Language. It’s About Equity.

This isn’t just a technical problem. It’s a moral one. When AI ignores a language, it ignores the people who speak it. It tells them their communication doesn’t matter. That’s not just bad engineering. It’s exclusion.

Optimal sampling doesn’t just improve accuracy. It restores dignity. It gives voice to the silent majority. And that’s why, in 2025, the most important thing you can do for your multilingual AI isn’t to train bigger. It’s to train better.

17 December 2025

Community and Ethics for Generative AI Programs: How to Build Trust Through Stakeholder Engagement and Transparency

22 December 2025

How to Choose Between API and Open-Source LLMs in 2025

25 February 2026

Email and CRM Automation with Large Language Models: Personalization at Scale

Megan Blakeman

This just made me cry. 😭 I grew up hearing my grandma speak Tagalog, and now I can’t even get my phone to understand her when she tries to use voice search. It’s not about tech-it’s about love. Thank you for saying this out loud.

More data isn’t the answer. More care is.

December 25, 2025 AT 11:04

Akhil Bellam

Oh wow. Another ‘woke AI’ manifesto. Of course the solution is to magically balance data like we’re distributing cookies at a kindergarten party. 🤡

Let me guess-next you’ll tell us that Swahili should get equal GPU time to English because ‘equity.’ Meanwhile, my Bengali-speaking clients can’t get a single accurate invoice translated because your ‘optimal sampling’ doesn’t account for actual business logic. This isn’t science-it’s performative virtue signaling wrapped in jargon. You’re not fixing AI. You’re just making it slower, more expensive, and even less useful for real people.

December 27, 2025 AT 02:42

Amber Swartz

OK BUT DID YOU KNOW THAT GUARANI ISN’T EVEN ON GPT-4’S RADAR? 😱

I just watched a video of a Paraguayan grandmother trying to use a medical chatbot in her native tongue and it responded with ‘I don’t understand.’ She cried. I cried. The whole world cried.

This isn’t a ‘technical problem.’ It’s a spiritual crisis. AI is supposed to serve humanity-not just the 5% who speak English fluently. If you’re not crying right now, you’re not paying attention. 🥺💔

December 28, 2025 AT 02:58

Robert Byrne

You’re right. But you’re also missing something huge: tokenization. Turkish, Finnish, and Hungarian aren’t just ‘low-resource’-they’re morphologically dense. You can’t treat them like English with fewer tokens. If you’re using BPE on agglutinative languages without language-specific subword segmentation, you’re literally training on garbage.

And don’t get me started on code-switching. If your dataset doesn’t include ‘I’m so tired, yaar, kya karu?’ from Mumbai millennials, your model is a toy. Real multilingual AI needs linguistic anthropology, not just math. This paper? Groundbreaking. But the implementation? Still in its infancy. Fix tokenization first.

December 28, 2025 AT 03:27

Tia Muzdalifah

so like… i live in LA and my abuela only speaks mixteco and the only app that even tried to help her was google translate and it just said ‘hello’ over and over 🤦‍♀️

but this? this actually made me hopeful. like… maybe one day she’ll be able to talk to a bot about her medicine without me having to translate every word. that’s kinda beautiful.

also who made the github repo? i wanna star it. 1247 stars is legit.

December 28, 2025 AT 18:42

Zoe Hill

I just tried this on my little Hugging Face project and OMG it worked. I trained a tiny model with 0.7% Quechua and 0.3% English and… it actually understood ‘Ama silla’ (don’t lie) in context. I cried. Not because it’s perfect-but because it tried.

Thank you for not giving up on the languages the world forgot. We need more of this. Not more data. More heart.

December 28, 2025 AT 23:22

Albert Navat

Let’s be real-this is just ‘data balance’ dressed up as ML innovation. You’re talking about sampling ratios like they’re sacred texts, but the real bottleneck isn’t the algorithm-it’s the data pipeline. Who’s cleaning 1 million tokens of Guarani? Who’s verifying it? Who’s paying for it? No one. Because it’s not profitable.

This isn’t about equity. It’s about optics. Companies will use this to check a box for the EU AI Act, then go back to training on English-only Reddit threads. The math is elegant. The execution? A fantasy. Unless you’re Google or Meta, you’re not doing this. And you know it.

December 30, 2025 AT 00:51

King Medoo

Let me tell you something you don’t want to hear. 🤐

This isn’t about language. It’s about power. The people who control the data-Silicon Valley, Western universities, Big Tech-are the same people who decided which languages ‘matter.’ They didn’t ignore Swahili because they were lazy. They ignored it because they didn’t need it. They don’t need your grandmother’s voice. They need your credit card number.

And now you’re giving them a moral excuse to pretend they care? No. This isn’t justice. It’s capitalism with a conscience. The model will still fail. The profit margins won’t. The 95% of the world’s languages? Still invisible. Still silent. Still exploited.

I’m not mad. I’m just… tired.

But hey, at least we got a GitHub repo with 1,247 stars. 🙃

December 30, 2025 AT 16:50

Rae Blackburn

Wait… this is a trap. The EU AI Act is just the first step. Next they’ll force every AI to support every language-or be banned. And who’s going to pay for it? YOU. Your taxes. Your job. Your privacy.

They’re not trying to help Guarani speakers. They’re trying to control you. Once they standardize ‘optimal sampling,’ they’ll control what words are allowed, what dialects are ‘valid,’ and who gets to speak. This isn’t equity. It’s linguistic fascism.

They’ll call it ‘fairness.’ But it’s just control with a smile.

They’re watching. They always are.

December 31, 2025 AT 03:17