A Silicon Valley AI lab released a multilingual chatbot trained on data covering 97 languages. English performance was state of the art. French and Spanish were solid. Swahili, Yoruba, and Bengali produced responses that were grammatically plausible but factually disconnected from the prompt — fluent hallucination in languages where the model had seen too little quality data to ground its outputs. The lab knew the problem. Their training data for these languages came primarily from web crawls: news sites, social media, government pages. The corpus was large in byte count but thin in quality, and it was missing the human-annotated supervision data that teaches a model what a good response looks like.
This is the core problem of multilingual data annotation for machine learning: the languages where models perform worst are the same languages where annotated training data is hardest to produce, because annotation talent for low-resource languages is scarce, quality standards for those languages are less established, and the commercial incentives to invest in annotation for languages with smaller user populations are weaker. The result is a vicious cycle — poor annotation leads to poor model performance, poor performance reduces adoption, reduced adoption reduces the incentive to invest in better annotation, and the gap between high-resource and low-resource languages keeps widening.
Data annotation for ML isn’t one thing. It’s a category that covers several distinct tasks, each with different quality requirements and different challenges when you scale across languages.
Text classification annotation assigns labels to text segments: sentiment, topic, intent, toxicity. This is the most common annotation task and the one where cross-language quality variance is most visible. A sentiment annotation guideline written for English doesn’t translate cleanly to Japanese, where indirectness and context-dependent politeness levels mean that a sentence that reads as neutral to a Japanese annotator might express strong negative sentiment to an English speaker who reads a literal translation. The annotation guidelines need to be adapted for each language’s communicative norms, not just translated. If you don’t do this adaptation, your multilingual sentiment model learns different definitions of sentiment in different languages, and your cross-language analytics are meaningless.
Named entity recognition annotation identifies and categorizes entities in text: people, organizations, locations, dates, products. The challenge in multilingual NER is that entity types and boundaries vary across languages and cultures. Chinese person names have a different structure than English names, and the segmentation rules for identifying name boundaries in Chinese text don’t correspond to English name tokenization. Arabic organization names often include geographic or religious references that English annotators wouldn’t classify as part of the organization name. Hindi location names use postpositions that change form based on grammatical case. If annotators apply English NER categories mechanically to other languages, the resulting dataset teaches the model to find English-style entities in non-English text, which means it misses entities that don’t fit the English pattern.
RLHF — reinforcement learning from human feedback — is the annotation task that matters most for large language models, and it’s where the multilingual quality gap has the most direct impact on user experience. RLHF works by presenting human annotators with multiple model outputs for the same prompt and asking them to rank which response is best. The model then adjusts its behavior to produce more outputs like the highly-ranked ones and fewer like the poorly-ranked ones. This is how you teach a model to be helpful, harmless, and honest instead of just fluent.
The multilingual RLHF problem is straightforward and expensive: you need annotators who are native speakers of the target language, who understand the cultural context well enough to judge whether a response is appropriate, and who can articulate ranking rationales in a format the training pipeline can process. For English, this talent pool is large. For Swahili, Tagalog, or Vietnamese, the pool is much smaller, and the annotators who exist often have different dialectal backgrounds, educational levels, and cultural assumptions that produce inconsistent rankings. Inconsistent rankings mean noisy reward signals, and noisy reward signals mean the RLHF fine-tuning doesn’t converge as well, which means the model in that language is less aligned and less reliable.
I talked to a data science team at a Bay Area startup that was fine-tuning a multilingual customer service chatbot for Southeast Asian markets. They had RLHF data for English, Mandarin, and Bahasa Indonesia but couldn’t find qualified annotators for Thai and Vietnamese at the volume they needed. The available annotators were generalist translators, not domain experts in customer service, and their rankings reflected translation quality rather than response utility. The model learned to produce grammatically correct Thai that didn’t actually answer the customer’s question, because the annotators had been ranking fluency over helpfulness. The team eventually built a custom annotation pipeline that included domain-specific training for annotators and a calibration phase where annotators ranked a shared set of examples to establish inter-annotator agreement before starting production work. The calibration phase alone took three weeks. The production annotation took another six. The total cost for Thai and Vietnamese RLHF data was 4x the per-language cost for English, which is a ratio that most AI labs budget for but many startups don’t anticipate.
The inter-annotator agreement problem is more severe for low-resource languages than for English, and it’s the reason that multilingual annotation projects need a more rigorous quality control framework than English-only projects. For English RLHF, typical inter-annotator agreement (measured by Cohen’s kappa) ranges from 0.6 to 0.8 depending on the task complexity. For languages where annotators have less shared cultural context and less exposure to the annotation conventions used in AI training, kappa values of 0.3-0.5 are common in early rounds, which means the annotators disagree more than they agree. If you fine-tune a model on data with that level of disagreement, the reward signal is essentially noise, and the fine-tuning doesn’t improve the model meaningfully.
The practical solution is a multi-stage quality pipeline: calibration rounds to establish shared understanding of the annotation guidelines, ongoing inter-annotator agreement monitoring with re-calibration when agreement drops below threshold, senior annotator review of borderline cases, and statistical sampling of final outputs for quality audit. This adds cost and time, but the alternative — shipping a model that performs well on English benchmarks and badly in the languages your users actually speak — is more expensive in the long run, because users who get bad results don’t come back.
There’s also a data architecture question that doesn’t get enough attention. Most multilingual annotation projects store annotations in a flat, language-agnostic schema: prompt, response, label, annotator ID. This schema doesn’t capture the language-specific metadata that’s critical for quality analysis: annotator dialect, annotation round, calibration set membership, inter-annotator agreement score. Without this metadata, you can’t diagnose why a model performs differently in two languages that have similar amounts of training data, and you can’t improve the annotation process iteratively. The better approach is a schema that includes language-specific quality metadata from the start, even though it makes the data pipeline more complex.
Artlangs Translation provides multilingual data annotation for machine learning across 230+ language pairs: text classification and NER annotation with language-adapted guidelines, RLHF annotation with native-speaker domain specialists, multi-stage quality pipelines with inter-annotator agreement monitoring, and data architecture consulting for multilingual training datasets. Because your model is only as good as the data it learns from, and in most languages, that data has never been good enough.
