Introduction to Multilingual OCR
Multilingual OCR models require large amounts of annotated image-text pairs to achieve high accuracy. Nemotron OCR v1, a strong English OCR model, failed to read documents accurately in other languages. Nemotron OCR v2, a production-ready, commercially usable OCR model, was trained on synthetic data and achieves better results than specialized variants across all languages.
Synthetic Data Generation
Synthetic data generation is crucial for building efficient multilingual OCR models. Key augmentation strategies include generating multilingual synthetic data with complex reading orders like right-to-left (RTL) to ensure global compatibility. This approach enables the creation of a single unified model that handles all target languages simultaneously.
12M
synthetic data samples
680K
real-world images
Training and Evaluation
Training a CLIP-ViT-S/32 on curated DataComp data yields a 9.2pp improvement over the raw baseline. The Nemotron OCR v2 multilingual model achieves better results on synthetic data than specialized variants across all languages. This innovative multilingual OCR model utilizes synthetic data to dramatically improve both accuracy and processing speed.
21.7%
relative improvement
2.6x
inference cost reduction

Real-World Applications
Real-world OCR involves reading receipts, IDs, multi-page contracts, and documents in dozens of languages. The Nemotron OCR v2 multilingual model is a single unified model that handles all five languages simultaneously, achieving near-zero NED scores across all target languages. This model is production-ready and commercially usable, making it an ideal solution for real-world applications.
Multilingual OCR Model Comparison
Multilingual OCR Model Comparison
| Component | Open / This Approach | Proprietary Alternative |
|---|---|---|
| Model Architecture | Single unified model | Specialized variants |
| Language Support | 5 languages | Limited language support |
| Training Data | Synthetic data | Real-world images only |
🔑 Key Takeaway
The Nemotron OCR v2 multilingual model, trained on synthetic data, achieves better results than specialized variants across all languages, with near-zero NED scores. This innovative approach enables the creation of a single unified model that handles all target languages simultaneously, making it an ideal solution for real-world applications.