Finally, we focus on the models' multilingual understanding of text in images and how to improve it. Unlike tasks based on natural images, text-in-image tasks cannot be translated trivially from English: even if the prompt and output text are translated, the text in the image remains in English. Because of this, we test how synthetic multilingual OCR data, which can be generated at scale in any number of languages, can help improve performance.
Training Setup: We generate synthetic OCR data using Synthdog. We generate 500k samples for pre-training and use 50k of those also during instruction-tuning. We consider the following setups: 100%, 50%, and 1% English (and the remainder uniformly spread over the other languages). Additionally, we consider a Latin-down setup that halves the samples for all Latin-script languages (to 2.5k from 5k as in 1% English) and doubles them for all other scripts (to 10k each). Importantly, the image encoder is now unfrozen and trained along with the rest of the model. Other pre-training and instruct-tuning data uses the L100 50% English setup.
Evaluation with SMPQA: Multilingual text-in-image evaluation data is limited. To this end, we propose SMPQA (Synthetic Multilingual Plot QA) to allow for evaluation in different languages. SMPQA generates synthetic plots in diverse languages (here: 5 Latin-script languages of different resource tiers, and 6 major non-Latin-script languages) with corresponding questions. There are two sub-task: grounding text given in the input prompt to the image for yes/no questions and reading text from the image.
Results: We make several observations:
Takeaway: To improve multilingual capabilities for text-in-image tasks, large-scale multilingual OCR data is key. Synthetic OCR data, as generated by us, works well, but especially languages using other scripts might need magnitudes more data to function.
We show the efficacy of our takeaways in practice by training Centurio, massively multilingual LVLMs with state-of-the-art performance. We make the following design choices for the models:
Results:
On average, Centurio achieves the best results across 14 tasks on their multilingual portions and additionally performs strongly on English.
These results prove the effectiveness of our training composition: we are able to retain high English performances while maximizing the models' multilingual capabilities.
When analyzing these results grouped by language tier, we find that our models shine in the low-resource tiers T1 and T2, with competitive results for higher-resource languages.
Only for text-heavy tasks (primarily MTVQA and SMPQA), Centurio falls behind.
While we show the importance of multilingual OCR training - Centurio succeeds at the SMPQA reading task in more languages than, for example, Pangea - the limited input resolution and magnitudes less OCR data compared to Qwen2-VL and others result in comparably poor performance.
The authors would like to thank the Pangea team for their project webpage template.
@article{centurio2025,
author = {Gregor Geigle and
Florian Schneider and
Carolin Holtermann and
Chris Biemann and
Radu Timofte and
Anne Lauscher and
Goran Glava\v{s}},
title = {Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model},
journal = {arXiv},
volume = {abs/2501.05122},
year = {2025},
url = {https://arxiv.org/abs/2501.05122},
eprinttype = {arXiv},
eprint = {2501.05122},
}