Saturday, 6 June 2026

MuRIL: Multilingual Representations for Indian Languages

MuRIL: Multilingual Representations for Indian Languages

The paper “MuRIL: Multilingual Representations for Indian Languages” introduces MuRIL, a multilingual language model built specifically for Indian languages. The motivation is simple but important: India is one of the most multilingual societies in the world, yet many general-purpose multilingual models do not perform well enough on Indian-language tasks.

MuRIL stands for Multilingual Representations for Indian Languages. It is designed to handle Indian-language text written in native scripts as well as transliterated text written in Latin script. This is very important in India because people often write Hindi, Telugu, Tamil, Bengali, Kannada, Marathi, Urdu, and other Indian languages using English letters in informal digital spaces such as chats, comments, and social media.

1. Problem Addressed by the Paper

India has a very large number of languages and dialects. The paper notes that India has 1369 rationalized languages and dialects, 22 scheduled languages, and 121 languages with more than 10,000 speakers. Despite this linguistic richness and India’s large digital footprint, many existing multilingual language models perform poorly on Indian languages.

A major reason is that multilingual models such as mBERT are trained on more than 100 languages at the same time. This means Indian languages receive limited representation in training data and vocabulary. As a result, the model may not learn Indian-language grammar, morphology, vocabulary, and usage patterns deeply enough.

Core problem: General multilingual models do not adequately represent Indian languages, especially low-resource languages and transliterated Indian-language text.

2. Why MuRIL Was Needed

The paper argues that Indian languages need a language model that is trained with focused attention on Indian linguistic realities. These realities include multiple scripts, uneven digital resources, code-mixing with English, and transliteration into Latin script.

For example, a Hindi sentence may appear in Devanagari script, but the same sentence may also appear as Roman Hindi in everyday typing. A general model trained mostly on native-script text may not understand the Romanized version properly.

MuRIL addresses this by training on three important forms of data:

  • monolingual Indian-language text,
  • translated Indian-language and English document pairs, and
  • transliterated native-script and Latin-script document pairs.

This makes MuRIL more suitable for Indian-language understanding than a broad multilingual model that treats Indian languages as only a small part of a very large multilingual pool.

3. Languages Supported by MuRIL

MuRIL supports 17 languages in total: 16 Indian languages and English. The Indian languages covered in the paper are:

Language Code Script / Context
AssameseasEastern Indo-Aryan language
BengalibnBengali script
GujaratiguGujarati script
HindihiDevanagari script
KannadaknKannada script
KashmiriksLow-resource Indian language in the dataset context
MalayalammlMalayalam script
MarathimrDevanagari script
NepalineDevanagari script
Oriya / OdiaorOdia script
PunjabipaGurmukhi context in Indian NLP use
SanskritsaClassical Indian language
SindhisdIndian-language context
TamiltaTamil script
TeluguteTelugu script
UrduurPerso-Arabic script
EnglishenUsed for cross-lingual transfer and translation alignment

4. Training Data Used

The paper uses several types of data to train MuRIL. This is one of the most important strengths of the model.

Data Type Source Purpose
Monolingual data Common Crawl OSCAR corpus and Wikipedia Helps the model learn language structure, vocabulary, and usage.
Translated data PMINDIA parallel corpus and machine-translated documents Helps the model align Indian-language text with English.
Transliterated data Dakshina dataset and indic-trans transliteration Helps the model understand Indian-language text written in Latin script.

The use of translated and transliterated document pairs is especially important because it provides supervised cross-lingual signals during training.

5. Training Objectives: MLM and TLM

MuRIL is trained using two language-modeling objectives: Masked Language Modeling and Translation Language Modeling.

Masked Language Modeling

Masked Language Modeling, or MLM, is the standard BERT-style training objective. Some tokens in a sentence are masked, and the model learns to predict the missing tokens from context.

Conceptually:

\[ \text{Input: } \text{The saree is [MASK].} \]

\[ \text{Model learns to predict: beautiful, red, traditional, etc.} \]

MLM uses monolingual text and helps the model learn the structure of each language.

Translation Language Modeling

Translation Language Modeling, or TLM, uses parallel text pairs. These may be Indian-language and English pairs, or native-script and transliterated pairs. The model sees both sides together and learns cross-lingual alignment.

A simplified view is:

\[ \text{Hindi sentence} + \text{English translation} \rightarrow \text{shared contextual representation} \]

For transliteration:

\[ \text{Native script sentence} + \text{Latin transliteration} \rightarrow \text{shared contextual representation} \]

This is important because it helps the model connect meaning across scripts and languages.

6. Why Transliteration Matters

Indian digital communication often uses transliteration. For example, someone may write Hindi, Telugu, Kannada, Bengali, or Tamil words using English letters. This is common in WhatsApp messages, social media posts, search queries, comments, product reviews, and informal customer feedback.

A model that only understands native scripts may fail on such data. MuRIL explicitly includes transliterated training examples, making it better suited for real Indian digital text.

Simple intuition: MuRIL is trained not only to understand “भारत” but also forms such as “bharat.” This makes it much more useful for Indian-language digital applications.

7. Upsampling Low-Resource Languages

The training corpus has uneven representation across languages. Some languages have much more available text than others. If the model is trained directly on the raw distribution, high-resource languages dominate and low-resource languages receive less learning attention.

To address this, the authors upsample low-resource languages using the following multiplier:

\[ m_i = \left( \frac{\max_{j \in L} n_j}{n_i} \right)^{1-\alpha} \]

Here, \(m_i\) is the multiplier for language \(i\), \(n_i\) is the token count for language \(i\), \(L\) is the set of languages, and \(\alpha\) is set to \(0.3\).

The upsampled token count becomes:

\[ m_i \times n_i \]

This gives smaller languages more representation during training while still preserving the overall multilingual structure.

8. Vocabulary and Tokenization

The paper places strong emphasis on vocabulary. MuRIL uses a cased WordPiece vocabulary learned from the upsampled pre-training data. The final vocabulary size is:

\[ 197,285 \]

This is much larger and more Indian-language-focused than the vocabulary representation available in mBERT for Indian languages.

The paper uses the idea of fertility ratio, which means the average number of subwords into which a word is split. A higher fertility ratio means a word is broken into more pieces, which may weaken semantic preservation.

For example, if a language model breaks one Indian-language word into many awkward fragments, it may struggle to understand the word as a meaningful unit.

Why MuRIL tokenization helps: MuRIL’s vocabulary contains better representation for Indian scripts and transliterated forms, so Indian-language words are split into fewer and more meaningful pieces than in mBERT.

9. Pre-training Details

MuRIL is trained as a BERT-base encoder model. The paper reports the following important pre-training details:

Aspect Reported Setting
Architecture BERT-base encoder
Objectives MLM and TLM
Maximum sequence length 512
Global batch size 4096
Training steps 1 million steps
Warm-up steps 50,000
Optimizer AdamW
Learning rate \(5 \times 10^{-4}\)
Parameters 236 million
Training tokens Approximately 16 billion unique tokens
Vocabulary size 197,285

10. Evaluation Method

The authors evaluate MuRIL on the XTREME benchmark, focusing on Indian-language test sets. The evaluation is done in a zero-shot cross-lingual setting. This means the model is fine-tuned on English training data and then evaluated on Indian-language test data.

This setting is challenging because the model must transfer learning from English to Indian languages. Strong performance in this setting indicates better cross-lingual understanding.

The tasks include:

  • Named Entity Recognition, or PANX
  • Part-of-Speech tagging, or UDPOS
  • Natural Language Inference, or XNLI
  • Sentence retrieval, or Tatoeba
  • Question Answering using XQuAD, MLQA, and TyDiQA-GoldP

11. Results Compared with mBERT

MuRIL outperforms mBERT across all reported Indian-language XTREME tasks. The average score improves from 59.1 for mBERT to 68.6 for MuRIL on native-script Indian-language test sets.

Task mBERT MuRIL Interpretation
PANX NER F1 58.0 77.6 Large improvement in named entity recognition.
UDPOS F1 71.2 75.0 Improved syntactic tagging.
XNLI Accuracy 66.8 74.1 Improved cross-lingual reasoning.
Tatoeba Accuracy 18.4 25.2 Better sentence retrieval, though still challenging.
XQuAD F1 / EM 71.2 / 58.2 79.1 / 65.6 Improved question answering.
MLQA F1 / EM 65.3 / 51.2 73.8 / 58.8 Better multilingual QA performance.
TyDiQA-GoldP F1 / EM 63.1 / 51.7 75.4 / 59.3 Strong improvement on typologically diverse QA data.
Average 59.1 68.6 MuRIL performs better overall.

Performance on Transliterated Indian-Language Test Sets

The improvement is even stronger on transliterated test sets. On Indian-language text transliterated into Latin script, the average score improves from 21.1 for mBERT to 48.9 for MuRIL.

Task on Transliterated Test Sets mBERT MuRIL
PANX F1 14.2 57.7
UDPOS F1 28.2 62.1
XNLI Accuracy 39.2 64.7
Tatoeba Accuracy 2.7 11.0
Average 21.1 48.9

This result directly supports the paper’s argument: Indian-language models must handle transliteration because Indian users often type local languages using Latin script.

12. Qualitative Examples

The paper includes qualitative examples showing that MuRIL handles context better than mBERT in several cases.

Named Entity Recognition

In one example, the phrase “Atlanta Falcons” refers to a football team. MuRIL predicts it as an organization, while mBERT incorrectly treats Atlanta as a location. This shows that MuRIL uses context more effectively.

In another example, “Shirdi’s Sai Baba” is correctly treated by MuRIL as a person, while mBERT incorrectly leans toward location because of the word “Shirdi.”

Sentiment Analysis

The paper also shows examples where MuRIL correctly handles mixed-language and transliterated sentences. For instance, a Hindi sentence containing an English word and a negation is correctly interpreted by MuRIL.

Question Answering

In question answering, MuRIL is shown to connect native-script and transliterated references better. For example, when a concept appears in Hindi in the context but in transliterated form in the question, MuRIL is able to infer the answer correctly.

13. Relevance for Indian Textile and Saree Research

At first glance, MuRIL is an NLP paper, not a textile paper. But it is very relevant for Indian textile and saree research because saree knowledge is multilingual. Saree names, craft clusters, motifs, weaving techniques, fabric types, and customer descriptions often appear in Indian languages, English, and mixed forms.

For example, the same textile concept may appear as:

  • “Kanjivaram saree”
  • “Kanchipuram pattu”
  • “kanjivaram pattu saree”
  • “कांजीवरम साड़ी”
  • “pattu saree”
  • “పట్టు చీర”

A model like MuRIL can help connect these language forms better than a general English-centric system.

MuRIL Concept Possible Textile / Saree Research Use
Indian-language representation Understand saree descriptions in Hindi, Telugu, Tamil, Kannada, Bengali, Marathi, and other languages.
Transliteration handling Process customer searches such as “pattu saree,” “banarasi,” “kanchi pattu,” or “pochampally ikat.”
Cross-lingual alignment Map regional craft names across Indian languages and English.
Named entity recognition Identify place names, craft clusters, textile types, motif names, and brand names from text.
Question answering Build textile knowledge assistants that answer questions from multilingual documents.
Sentiment analysis Analyze customer reviews written in mixed Indian languages and English.

For saree provenance research, MuRIL can support the text side of a multimodal system. Image models can analyze motifs and fabric appearance, while MuRIL can process product descriptions, craft documentation, GI descriptions, artisan narratives, and customer queries.

14. Limitations and Future Scope

MuRIL is a strong contribution, but the paper also has practical boundaries. It currently supports 16 Indian languages plus English, not all Indian languages and dialects. India’s linguistic diversity is far larger than the supported set.

Another limitation is that the paper focuses on language understanding benchmarks. It does not directly test domain-specific use cases such as textiles, legal documents, medical records, education, or e-commerce product search.

For textile research, MuRIL would need to be further fine-tuned or combined with textile-specific vocabulary, saree descriptions, catalog data, craft cluster knowledge, and regional terminology.

Limitation Suggested Future Direction
Limited language coverage Extend to more Indian languages, dialects, and scripts.
Benchmark-focused evaluation Evaluate on domain-specific tasks such as e-commerce, crafts, healthcare, law, or education.
Text-only model Combine with image models for multimodal Indian-language applications.
General vocabulary Fine-tune on textile, saree, craft, and cultural heritage corpora.
Transliteration variability Handle multiple informal spellings of the same Indian-language word.

15. Simple Summary

MuRIL is a multilingual language model created specifically for Indian languages. It addresses a major weakness of general multilingual models: Indian languages are often underrepresented in their training data and vocabulary.

The model is trained on monolingual, translated, and transliterated data. It supports 16 Indian languages plus English and uses both Masked Language Modeling and Translation Language Modeling. It has a vocabulary of 197,285 tokens, 236 million parameters, and is trained on approximately 16 billion unique tokens.

Compared with mBERT, MuRIL performs better on Indian-language XTREME tasks. The improvement is especially large on transliterated Indian-language text, where users write Indian languages using Latin script. This makes MuRIL highly useful for real-world Indian digital language applications.

For saree and textile research, MuRIL can help process multilingual product descriptions, customer reviews, craft documentation, regional terminology, and transliterated search queries. It can become the language component of a larger multimodal saree understanding system.

16. General Disclaimer

This article is an educational explanation of the research paper “MuRIL: Multilingual Representations for Indian Languages.” It is intended for conceptual understanding, academic discussion, and research learning. Some technical details have been simplified for readability. Readers interested in implementation details, model usage, full per-language results, and exact training configuration should refer to the original paper and the released MuRIL resources.

```

No comments:

Post a Comment

Understanding the Paper: Drishtikon

DRISHTIKON: A Multimodal Multilingual Benchmark for Indian Cultural Understanding The paper “DRISHTIKON: A Multimodal Multilingual Benchm...