Qalam : A Multimodal LLM For Arabic Optical Character And Handwriting Recognition | Awesome LLM Papers Add your paper to Awesome LLM Papers

Qalam : A Multimodal LLM For Arabic Optical Character And Handwriting Recognition

Gagan Bhatia, El Moatez Billah Nagoudi, Fakhraddin Alwajih, Muhammad Abdul-Mageed . No Venue 2024

[Paper]   Search on Google Scholar   Search on Semantic Scholar
Datasets Efficiency Image Text Integration Model Architecture Productivity Enhancement Visual Contextualization

Arabic Optical Character Recognition (OCR) and Handwriting Recognition (HWR) pose unique challenges due to the cursive and context-sensitive nature of the Arabic script. This study introduces Qalam, a novel foundation model designed for Arabic OCR and HWR, built on a SwinV2 encoder and RoBERTa decoder architecture. Our model significantly outperforms existing methods, achieving a Word Error Rate (WER) of just 0.80% in HWR tasks and 1.18% in OCR tasks. We train Qalam on a diverse dataset, including over 4.5 million images from Arabic manuscripts and a synthetic dataset comprising 60k image-text pairs. Notably, Qalam demonstrates exceptional handling of Arabic diacritics, a critical feature in Arabic scripts. Furthermore, it shows a remarkable ability to process high-resolution inputs, addressing a common limitation in current OCR systems. These advancements underscore Qalam’s potential as a leading solution for Arabic script recognition, offering a significant leap in accuracy and efficiency.

Similar Work