Lorya

Improving the accuracy, consistency and reliability of document processing for non-standard orthographies and low-resource languages.

Past and Current Partners

United Nations Development Programme (UNDP), Government of the Republic of Serbia, Government of France, Government of Japan

Active Countries
The Republic of Serbia
Thematic area(s)
Inclusive Growth
Technology
AI/ML, SaaS, Open Web-based Application, Optical Character Recognition (OCR)
Organisation Name
UNDP Serbia
READ MORE
Digital X Solution Lorya

The Problem

AI language technology nowadays is mostly developed for English and other high-resource languages, which risks excluding communities that speak under-represented and low-resource languages from potential benefits of AI. Around the world, machine-readable data for low-resource languages have historically been limited, which is compounded by today’s narrow focus on adapting new technologies to non-standard documents. Several cultural and historical materials as a result tend to exist in images and scans, with traditional Optical Character Recognition (OCR) tools failing to process complex scripts and older typographies. Overall, this limits access to public information, government services, and platforms in many of the most linguistically diverse regions of the world, thereby creating risks to human and political security.

The Solution

Lorya is an innovative open, web-based application and machine learning pipeline, which is designed to improve the accuracy, consistency and reliability of document processing for low-resource and orthographically complex languages. The tool transforms physical documents (such as manuscripts, newspapers, and books) into machine-readable text that can be used to train AI language models in local languages. By leveraging advanced OCR and machine learning techniques, Lorya empowers more than 11 million Serbian speakers to unlock, digitize and utilize historical documents, expanding their access to participate in the global AI revolution. With a language- and orthography-agnostic architecture, intuitive design and easy-to-use data validation interface, Lorya is adapted to support language digitization efforts led by low-resource language communities across the globe.

How it works?

  • Step 1: Local communities compile physical scans or photographs of documents for processing.
  • Step 2: Countries and language communities gain access to the open, web-based Lorya application, which is designed to support diverse languages and orthographies through a universal user interface.
  • Step 3: Local teams configure the application for their specific language(s) and writing systems. Lorya's language- and orthography-agnostic design makes adapting to new scripts simple, including complex typographies, mixed orthographies, or unique alphabet systems.
  • Step 4: Users upload images and scans of printed materials. Lorya’s advanced machine learning pipeline automatically converts these into machine-readable text.
  • Step 5: Lorya allows human processors to review its results and flag any potential inaccuracies for recursive improvement.
  • Step 6: As documents are processed, the platform’s machine learning engine iteratively learns from new data, enhancing accuracy across languages.
Digital X Solution AI Coach

Bridging the digital divide

Lorya is helping to address challenges faced by many linguistically marginalized communities, by integrating advanced, state-of-the-art OCR models into an interface adapted for deployment in low-resource contexts. Its open platform and flexible approach empower local teams across countries to adapt and deploy scalable solutions, towards increasing the availability of cultural heritage materials in digital form for researchers, students and the public.

Impact and highlights

Lorya has enabled the digitization and AI-based processing of cultural and historical materials for under-represented languages, significantly improving access for researchers, students, and the public. The technology has delivered marked advancements in OCR accuracy and efficiency across scripts and orthographies that were previously unreachable with traditional tools. Lorya has been successfully piloted with the historical collections of the National Library of Serbia, processing 16,000 archival periodicals into machine-readable text with significantly improved results compared to off-the-shelf software. A new interface, improved image segmentation and post-OCR improvement have been developed to adapt the pipeline flexibly to a complex linguistic context.

Plans for expansion

Further development of Lorya includes moving beyond a national innovation towards a reusable open-source tool for communities around the world. Following successful piloting in Serbia, Lorya is preparing deployments in Iraq and Nepal in 2026, demonstrating its adaptability across distinct linguistic and institutional environments. As part of an effort to further engage the Arabic-speaking community, Lorya is being showcased at the Conference of the European Chapter of the Association for Computational Linguistics 2026 in Morocco, enabled by Digital X 3.0. These efforts are expected to support future scaling in Arabic speaking countries across the Middle East and Africa.