Saturday, 6 June 2026

Understanding the Paper: Drishtikon

DRISHTIKON: A Multimodal Multilingual Benchmark for Indian Cultural Understanding

The paper “DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models’ Understanding on Indian Culture” introduces a new benchmark for evaluating whether modern vision-language models can understand Indian culture through both images and text. The word Drishtikon means “perspective” or “point of view,” which is appropriate because the benchmark tests how AI systems perceive and reason about Indian cultural contexts.

The paper argues that many large language models and vision-language models perform well on general tasks but often struggle with culturally specific knowledge. This is especially important in India, where culture is expressed through many languages, scripts, regions, clothing traditions, cuisines, festivals, rituals, monuments, art forms, and local practices.

1. Problem Addressed by the Paper

The central problem addressed by the paper is that current AI systems are not always culturally aware. They may recognize objects in images, answer common questions, or translate text, but they may fail when the task requires understanding Indian cultural context.

For example, an AI model may recognize that an image contains a dance costume, a food item, or a monument, but it may not know the regional, ritual, historical, or cultural significance of that image. Similarly, it may perform better in English or Hindi but struggle with lower-resource Indian languages such as Sindhi, Konkani, Assamese, or Odia.

Core problem: Existing multimodal benchmarks do not adequately test whether AI models understand India’s cultural diversity across languages, regions, images, and reasoning tasks.

2. Why DRISHTIKON Was Needed

Existing benchmarks often test general visual understanding, multilingual reasoning, or global cultural knowledge. However, the paper argues that these benchmarks do not give enough fine-grained attention to India’s cultural complexity.

India has enormous cultural diversity across its states and union territories. Cultural knowledge is not only about national-level symbols. It includes regional festivals, folk traditions, food practices, attire, religious rituals, architecture, performing arts, historical personalities, and local heritage.

The authors therefore create a benchmark that brings together three dimensions:

  • Multimodal understanding: the model must interpret both image and text.
  • Multilingual understanding: the model must answer in multiple Indian languages.
  • Cultural reasoning: the model must understand region-specific Indian cultural context.

3. Main Contribution of the Paper

The paper’s main contribution is the creation of DRISHTIKON, a multimodal and multilingual benchmark centered on Indian culture. It contains image-question pairs translated across multiple Indian languages and designed to test both factual and reasoning-based cultural understanding.

Aspect DRISHTIKON Contribution
Coverage All 28 Indian states and 8 union territories.
Languages 15 languages including English and 14 Indian languages.
Dataset size 64,288 question-image-language triples.
Cultural themes Festivals, attire, cuisine, folk arts, rituals, heritage, tourism, personalities, and more.
Question format Multiple-choice questions with one correct answer and three distractors.
Reasoning types General, commonsense cultural, multi-hop reasoning, and analogy questions.
Evaluation target Vision-language models, including open-source, proprietary, reasoning-specialized, and Indic-aligned models.

4. Dataset Construction Pipeline

The paper presents a clear dataset creation pipeline. According to the workflow diagram in the paper, the process begins with knowledge curation and MCQ generation, moves through cultural categorization and tagging, adds reasoning-based augmentation, translates the data into Indian languages, and finally assembles the benchmark.

The pipeline can be represented as:

\[ \text{Knowledge Curation} \rightarrow \text{MCQ Generation} \rightarrow \text{Cultural Tagging} \rightarrow \text{Reasoning Augmentation} \rightarrow \text{Multilingual Translation} \rightarrow \text{Final Dataset} \]

This pipeline is important because cultural benchmarking cannot be done by simply collecting random images. The questions must be culturally meaningful, regionally balanced, linguistically accurate, and visually grounded.

5. Knowledge Curation and MCQ Generation

The authors curated cultural knowledge from sources such as national repositories, state tourism portals, academic collections, and curated crowdsourced platforms. The content covers areas such as festivals, attire, cuisine, folk traditions, monuments, personalities, and other cultural markers.

The authors first created 2,126 English multiple-choice questions. Each question has one correct answer and three distractors. The distractors are not random. They are designed to test whether the model can resist plausible but incorrect options.

A typical MCQ includes:

  • one correct answer,
  • one semantically close distractor,
  • one option reflecting a common misconception, and
  • one unrelated but superficially similar option.

This makes the questions harder than simple recognition questions. A model cannot answer reliably only by detecting a broad object or keyword; it must understand the cultural association.

Important design choice: The authors use MCQs because they allow consistent scoring across many models and languages. Since each question has four options, random guessing has a chance level of \(25\%\).

6. Cultural Categorization and Attribute Tagging

Each question-image pair is tagged with one or more cultural attributes. These tags allow performance to be analyzed by cultural category. For example, researchers can check whether models perform better on cuisine than on rituals, or better on tourism than on folk arts.

The paper’s attribute chart shows the distribution of questions across cultural aspects. The largest category is Cultural Common Sense, followed by History, Rituals and Ceremonies, Tourism, Language, Dance and Music, and other themes.

Cultural Attribute Approximate Question Count Reported
Art3450
Costume2280
Cuisine4335
Cultural Common Sense14085
Dance and Music4455
Festivals4153
History11055
Language4545
Medicine195
Nightlife30
Personalities1110
Religion1170
Rituals and Ceremonies7005
Sports270
Tourism5745
Transport405

This attribute tagging is one of the strengths of the benchmark because it allows fine-grained diagnosis of model weaknesses.


7. Reasoning-Based Question Augmentation

The authors did not stop at factual questions. They selected a balanced subset of 720 questions, approximately 20 per region, and converted them into deeper reasoning questions.

This produced 2,160 additional MCQs across three reasoning categories:

Reasoning Category What It Tests Example Type
Common Sense Cultural Everyday cultural inference. Matching attire, food, festival, or social practice with cultural context.
Multi-hop Reasoning Linking multiple cultural facts. Connecting a dance form to a festival and then to a state.
Analogy Pattern matching across cultural examples. Relating one state’s art form to another state’s equivalent cultural pattern.

This reasoning augmentation makes DRISHTIKON more than a visual recognition dataset. It becomes a test of cultural inference.

8. Multilingual Translation and Scale-up

To make the benchmark multilingual, the authors translated the questions into 14 Indian languages: Hindi, Bengali, Tamil, Telugu, Marathi, Kannada, Malayalam, Gujarati, Punjabi, Odia, Assamese, Urdu, Konkani, and Sindhi.

Together with English, this gives:

\[ 15 \text{ languages} \]

The full dataset contains:

\[ 64,288 \text{ question-image-language triples} \]

The authors used Gemini Pro for translation and then applied a two-stage human verification protocol on stratified samples to check meaning preservation, fluency, and cultural relevance.

For culturally specific terms that do not have direct equivalents in another language, the authors used transliteration or context-sensitive phrasing. This is important because Indian cultural words often cannot be translated literally without losing meaning.

9. Models Evaluated

The paper evaluates many types of vision-language models. This broad evaluation makes the benchmark useful because it compares small models, large models, proprietary systems, reasoning-specialized systems, and Indic-focused systems.

Model Category Examples Evaluated Purpose of Inclusion
Small open-source VLMs SmolVLM-256M-Instruct, InternVL3-1B Test whether compact models can perform well on cultural tasks.
Large open-source VLMs Janus-Pro-7B, Qwen2-VL-7B-Instruct, LLaVA-1.6-Mistral-7B, InternVL3-14B, Gemma-3-27B-IT, Qwen2.5-Omni-7B Test whether larger scale improves cultural reasoning.
Proprietary VLMs GPT-4o-mini Compare against a strong commercial model.
Reasoning-specialized VLMs Kimi-VL-A3B-Thinking Test whether reasoning-focused models handle cultural questions better.
Indic-aligned models Chitrarth, Maya Evaluate models designed with Indian or multilingual contexts in mind.

Accuracy is used as the primary evaluation metric:

\[ Accuracy = \frac{\text{Number of Correct Answers}}{\text{Total Number of Questions}} \]

10. Major Results and Findings

The paper reports several important findings. First, model size alone does not guarantee better cultural understanding. Some compact instruction-tuned models perform surprisingly well, while some larger models show unstable results.

Second, proprietary models such as GPT-4o-mini perform strongly across languages and question types. This suggests that broad instruction tuning and strong multimodal alignment help in cultural tasks.

Third, Maya, an Indian-origin or Indic-aligned model, performs competitively, showing the value of regionally focused AI development.

Fourth, model performance varies significantly by language. English, Hindi, Bengali, and Marathi tend to be easier for models, while Sindhi, Konkani, Kannada, Assamese, and Odia show more difficulty in several cases. This reflects the digital-resource imbalance across Indian languages.

Research Question Main Finding
Does model scale predict performance? No. Larger models are often strong, but smaller well-aligned models can outperform bigger models on cultural tasks.
Do models perform equally across languages? No. High-resource languages generally perform better than low-resource Indian languages.
Which question types are hardest? Multi-hop reasoning and analogy questions are harder than general and commonsense cultural questions.
Do Indic-focused models help? Some Indic-focused models, especially Maya, show strong promise, but not all Indic-aligned models perform equally well.
Does Chain-of-Thought help? Yes, especially for reasoning-heavy questions, but gains vary across model types and languages.

Language-Level Performance Pattern

The paper’s language-wise chart shows that overall average accuracy is highest for Gujarati, Hindi, and English among the listed languages, while Kannada and Sindhi appear among the most difficult. This does not mean those cultures are inherently harder. It means current models likely have less reliable exposure, training data, or alignment for those language-cultural combinations.

Regional Performance Pattern

The radar plots show uneven state-wise performance. Regions with stronger media visibility or more widely represented cultural signatures, such as Kerala, Gujarat, and West Bengal, tend to show more consistent performance. Smaller or less-represented regions such as Lakshadweep, Mizoram, and Dadra and Nagar Haveli show weaker results.

11. Zero-Shot vs Chain-of-Thought Prompting

The paper compares zero-shot prompting with Chain-of-Thought prompting. In zero-shot prompting, the model answers directly without being given examples. In Chain-of-Thought prompting, the model is encouraged to reason step by step before selecting the answer.

Chain-of-Thought prompting can be written conceptually as:

\[ \text{Image} + \text{Question} + \text{Options} \rightarrow \text{Reasoning Steps} \rightarrow \text{Answer} \]

The paper finds that Chain-of-Thought prompting helps most in reasoning-intensive categories such as multi-hop and analogy questions, with gains reported up to approximately \(10\%-15\%\) in some settings. However, the improvement is not uniform across all models and languages.

Important insight: Chain-of-Thought helps cultural reasoning, but it does not fully solve the problem of low-resource language gaps or culturally specific visual understanding.

12. Technical View of the Benchmark

From a machine-learning perspective, DRISHTIKON can be understood as a multimodal multiple-choice evaluation dataset.

Each instance can be represented as:

\[ D_i = (I_i, Q_i^{(l)}, O_i, y_i, A_i, R_i, T_i) \]

where:

  • \(I_i\) is the image,
  • \(Q_i^{(l)}\) is the question in language \(l\),
  • \(O_i = \{o_1,o_2,o_3,o_4\}\) is the set of answer options,
  • \(y_i\) is the correct option,
  • \(A_i\) is the cultural attribute tag,
  • \(R_i\) is the region or state/UT tag, and
  • \(T_i\) is the question type.

A vision-language model must estimate:

\[ \hat{y}_i = \arg\max_{o_j \in O_i} P(o_j \mid I_i, Q_i^{(l)}) \]

The final accuracy is:

\[ Accuracy = \frac{1}{N}\sum_{i=1}^{N} \mathbf{1}(\hat{y}_i = y_i) \]

This formulation shows why DRISHTIKON is useful. It allows accuracy to be sliced by language, region, cultural theme, model type, and question type.

13. Relevance for Saree, Textile, and Cultural Heritage Research

This paper is highly relevant for saree and textile research because sarees are not only visual products; they are cultural objects. A saree’s meaning may depend on region, weaving cluster, ritual context, community use, motif symbolism, language, and heritage association.

For example, a model trained only on product images may identify color or pattern, but it may not understand why a Kanjivaram saree, Paithani saree, Mekhela Chador, Bandhani, Patola, Kasavu, Banarasi brocade, or Baluchari design has specific cultural meaning.

DRISHTIKON Concept Possible Saree / Textile Research Use
Multimodal benchmarking Evaluate models using both saree images and textile descriptions.
Multilingual questions Test saree knowledge in Hindi, Telugu, Tamil, Kannada, Bengali, Gujarati, Marathi, Malayalam, and other languages.
Cultural attribute tags Create textile categories such as weave, motif, region, ritual use, pallu, border, and craft cluster.
State-wise coverage Build region-wise saree provenance datasets across Indian weaving clusters.
Reasoning-based questions Ask deeper questions such as why a motif, border, or drape style belongs to a particular tradition.
Chain-of-Thought evaluation Check whether models can explain textile classification rather than only predict a label.

For a saree provenance classification project, DRISHTIKON suggests an important direction: evaluation should not be limited to image classification accuracy. A stronger benchmark could ask whether the model understands the relationship between image features, regional craft identity, local terminology, and cultural meaning.

14. Limitations and Future Scope

The paper is ambitious and important, but it also acknowledges limitations. India’s cultural diversity is extremely large, so even a benchmark covering 15 languages and all states and union territories cannot capture every dialect, local practice, community tradition, or regional nuance.

Another limitation is that the dataset uses curated image-text pairs. This allows controlled evaluation, but real-world cultural understanding is often messier. Images may be ambiguous, mixed, poorly labeled, or used in changing social contexts.

The paper also shows that many models still struggle with abstract analogy and multi-hop reasoning. This suggests that cultural AI needs better reasoning frameworks, better multilingual representation, and more balanced regional data.

Limitation Possible Future Direction
Incomplete cultural coverage Expand to more dialects, local practices, oral traditions, and community-specific knowledge.
Curated image-text setting Test on real-world images, social media, e-commerce listings, and archival materials.
MCQ-only format Add open-ended answering and explanation-based evaluation.
Language imbalance Create more data for low-resource Indian languages.
Reasoning weakness Develop culturally grounded reasoning datasets and fine-tuning methods.
Image URL dependence Ensure long-term accessibility and licensing clarity for cultural image resources.

15. Simple Summary

DRISHTIKON is a multimodal and multilingual benchmark created to test whether AI models understand Indian culture. It contains culturally grounded image-question pairs across 15 languages and all Indian states and union territories.

The dataset begins with 2,126 English MCQs, adds 2,160 reasoning-augmented MCQs, translates them into 14 Indian languages, and produces 64,288 question-image-language triples. Each item includes an image, a question, four answer options, one correct answer, and metadata such as cultural attribute, region, language, and question type.

The paper evaluates many vision-language models and finds that current models still have major gaps. GPT-4o-mini performs strongly, compact models such as SmolVLM and InternVL3-1B are surprisingly competitive, and the Indian-origin Maya model shows promise. However, performance remains uneven across languages, regions, and reasoning types.

For saree and textile research, the paper is important because it shows how cultural understanding can be benchmarked in a multimodal way. A future saree AI system should not only identify images but also understand regional identity, textile terminology, craft heritage, and cultural context.

16. General Disclaimer

This article is an educational explanation of the research paper “DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models’ Understanding on Indian Culture.” It is intended for conceptual understanding, academic discussion, and research learning. Some technical details have been simplified for readability. Cultural interpretation should always be treated with care, and AI-based cultural understanding should support, not replace, community knowledge, expert scholarship, and lived cultural experience.

```

Understanding the Paper: IndicNLPSuite

IndicNLPSuite: Corpora, Benchmarks, and Language Models for Indian Languages

The paper “IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages” presents a comprehensive set of resources for Natural Language Processing in Indian languages. The work addresses a major gap in Indian-language AI: the lack of large monolingual corpora, reliable evaluation benchmarks, and pre-trained language models designed specifically for Indic languages.

The authors introduce four major resources: IndicCorp, IndicFT, IndicBERT, and IndicGLUE. Together, these resources provide data, embeddings, language models, and benchmarks for 11 major Indian languages plus Indian English.

1. Problem Addressed by the Paper

Indian languages are spoken by more than a billion people, yet NLP resources for these languages have historically been limited. The paper points out that Indic languages include several of the most widely spoken languages in the world, but large publicly available monolingual corpora and systematic benchmarks have been missing.

This lack of resources creates two major problems. First, it becomes difficult to train high-quality word embeddings and language models. Second, it becomes difficult to evaluate whether new models are actually improving Indian-language understanding across different tasks.

Core problem: Indian languages need large corpora, pre-trained models, and evaluation benchmarks so that NLP research can progress beyond isolated datasets and small experiments.

2. Why Indic NLP Resources Matter

Indian languages are morphologically rich, script-diverse, and structurally different from English. Many Indian languages follow Subject-Object-Verb word order and contain rich inflectional forms. This means that English-centric NLP tools cannot simply be copied and expected to work well.

Another important issue is language diversity. The paper focuses on 11 major Indian languages from Indo-Aryan and Dravidian language families, along with Indian English. These languages include Punjabi, Hindi, Bengali, Odia, Assamese, Gujarati, Marathi, Kannada, Telugu, Malayalam, and Tamil.

Challenge Why It Matters for Indian NLP
Morphological richness Words appear in many forms, so models need subword-aware representations.
Multiple scripts Different languages use different scripts, increasing vocabulary complexity.
Low resource availability Many Indian languages lack large public corpora and task datasets.
Evaluation gap Without benchmarks, it is difficult to compare models systematically.
Cross-lingual transfer Models should use relatedness among Indian languages to improve performance.

3. What is IndicNLPSuite?

IndicNLPSuite is a collection of NLP resources for Indian languages. It includes corpora, embeddings, language models, and evaluation benchmarks.

Resource Full Form / Meaning Purpose
IndicCorp Large monolingual corpora Provides training data for Indian-language models.
IndicFT FastText-based word embeddings Provides word-level and subword-aware representations.
IndicBERT ALBERT-based multilingual language model Provides contextual language representations for Indic NLP tasks.
IndicGLUE Indian General Language Understanding Evaluation benchmark Provides evaluation tasks for Indian-language NLU.

4. IndicCorp: Large Monolingual Corpora

IndicCorp is a large sentence-level monolingual corpus for 11 Indian languages and Indian English. The paper reports a total of approximately 8.8 billion tokens across these languages. The corpus is primarily sourced from news crawls and supplemented with OSCAR Common Crawl data.

The dataset is designed to reflect contemporary Indian-language use across news articles, magazines, and blog posts. The authors emphasize that their corpus is significantly larger than many existing resources for Indian languages.

Language Sentences in Millions Tokens in Millions Types in Millions IndicCorp / OSCAR Ratio
Punjabi29.27733.022
Hindi63.118606.52
Bengali39.98366.62
Odia6.941071.49
Assamese1.3932.60.88
Gujarati41.17195.714
Marathi34.05515.87
Kannada53.371311.914
Telugu47.96749.48
Malayalam50.272117.78
Tamil31.558211.42
Indian English54.312204.5-
Total452.8878984.7-

5. Corpus Creation and Text Processing

The authors collected data mainly from Indian-language news websites. They used automated article extraction tools such as BoilerPipe and also wrote custom extractors using BeautifulSoup where needed.

After extraction, the text was cleaned and processed. The paper mentions important processing steps such as Unicode canonicalization, sentence splitting, tokenization, de-duplication, and sentence shuffling.

For de-duplication, a hashing approach is used. Conceptually, this can be understood as:

\[ \text{Sentence} \rightarrow \text{Hash Value} \rightarrow \text{Remove Duplicate Hashes} \]

This helps avoid repeated sentences from distorting corpus statistics and model training.

6. IndicGLUE: Indian Language Understanding Benchmark

IndicGLUE is an evaluation benchmark for Indian-language natural language understanding. It includes both existing datasets and new datasets created by the authors.

The benchmark includes tasks such as:

  • news category classification,
  • headline prediction,
  • Wikipedia section-title prediction,
  • cloze-style multiple-choice question answering,
  • named entity recognition,
  • cross-lingual sentence retrieval,
  • Winograd natural language inference,
  • COPA commonsense reasoning,
  • paraphrase detection,
  • discourse mode classification, and
  • sentiment analysis.
IndicGLUE Task What the Model Must Do Why It Matters
News Category Classification Predict article category such as sports, politics, business, or entertainment. Tests topic understanding.
Headline Prediction Select the correct headline for a news article. Tests article-level comprehension.
Wikipedia Section-title Prediction Select the correct section title from candidates. Tests summarization-like understanding.
Cloze-style QA Predict a masked entity from multiple choices. Tests knowledge and context use.
NER Identify people, organizations, and locations. Useful for information extraction.
Cross-lingual Sentence Retrieval Retrieve the translation of an English sentence in an Indian language. Tests multilingual alignment.

7. IndicFT: FastText Word Embeddings

IndicFT refers to FastText word embeddings trained on IndicCorp. The authors choose FastText because Indian languages are morphologically rich. FastText represents words using character n-grams, which helps it handle word forms and rare words better than purely word-level methods.

A word representation in FastText can be understood as a combination of subword representations:

\[ v(w) = \sum_{g \in G_w} z_g \]

Here, \(v(w)\) is the vector for word \(w\), \(G_w\) is the set of character n-grams in the word, and \(z_g\) is the vector for each n-gram.

This is important for Indian languages because suffixes, inflections, and compound forms can create many surface forms of the same root word.

The paper reports that IndicFT generally outperforms baseline FastText embeddings trained on Wikipedia or Wikipedia plus Common Crawl across several tasks, including text classification and bilingual lexicon induction.

8. IndicBERT: Multilingual Language Model

IndicBERT is a multilingual language model trained on IndicCorp. It is based on the ALBERT architecture, which is a compact variant of BERT. The authors choose ALBERT because it has fewer parameters and is easier to distribute and use in downstream applications.

IndicBERT is trained using the standard Masked Language Modeling objective. In this objective, some tokens are masked and the model learns to predict them using context.

The idea can be represented as:

\[ P(x_m \mid x_{\setminus m}) \]

Here, \(x_m\) is the masked token and \(x_{\setminus m}\) represents the remaining context. The model learns to predict the missing token from the surrounding sentence.

The paper trains both IndicBERT base and IndicBERT large. The model uses a SentencePiece tokenizer with a vocabulary size of \(200,000\), which helps accommodate different scripts and large vocabularies of Indian languages.

IndicBERT Feature Description
Base architecture ALBERT
Training corpus IndicCorp
Languages 11 Indian languages plus Indian English
Objective Masked Language Modeling
Tokenizer SentencePiece
Vocabulary size 200,000
Training steps 400,000 steps

9. Evaluation and Results

The paper evaluates IndicFT and IndicBERT on several tasks. The results show that IndicFT often outperforms existing FastText embeddings, and IndicBERT is competitive with or better than mBERT and XLM-R on many IndicGLUE tasks.

IndicFT Results

On text classification, IndicFT achieves an average accuracy of 75.80%, compared with 69.25% for FastText Wikipedia and 68.32% for FastText Wikipedia plus Common Crawl.

Embedding Average Text Classification Accuracy
FastText Wikipedia 69.25%
FastText Wikipedia + Common Crawl 68.32%
IndicFT 75.80%

On the IndicGLUE News Category test set, IndicFT achieves an average accuracy of 97.52%, compared with 95.52% and 95.63% for the two FastText baselines.

IndicBERT Results

IndicBERT performs strongly across many IndicGLUE tasks. On multiple-choice tasks, IndicBERT base achieves an average of 95.46% on news article headline prediction and 41.87% on cloze-style multiple-choice QA. On public datasets, IndicBERT base achieves an average accuracy of 77.39%, compared with 74.42% for mBERT and 76.60% for XLM-R.

Evaluation Area Observation
Headline Prediction IndicBERT large performs strongly, with average accuracy around 95.87%.
Article Genre Classification IndicBERT base performs very strongly, averaging around 97.34%.
Public datasets IndicBERT base outperforms mBERT and XLM-R on average.
Cross-lingual Sentence Retrieval IndicBERT large achieves the strongest reported average among compared models.
NER mBERT performs better than IndicBERT in this task, likely because of Wikipedia exposure during pre-training.

10. Important Technical Ideas and Equations

Mean Word Embedding for Text Classification

For some text classification experiments, the text representation is created by averaging word embeddings:

\[ v_{text} = \frac{1}{N}\sum_{i=1}^{N} v(w_i) \]

Here, \(v_{text}\) is the text vector, \(N\) is the number of words, and \(v(w_i)\) is the embedding of the \(i^{th}\) word.

Cross-Entropy Loss for Classification

Many classification tasks use cross-entropy loss:

\[ L = - \sum_{i=1}^{C} y_i \log(\hat{y}_i) \]

Here, \(C\) is the number of classes, \(y_i\) is the true label indicator, and \(\hat{y}_i\) is the predicted probability for class \(i\).

Cosine Similarity for Sentence Retrieval

For cross-lingual sentence retrieval, sentence similarity can be measured using cosine similarity:

\[ \cos(\theta) = \frac{A \cdot B}{\|A\|\|B\|} \]

Here, \(A\) and \(B\) are sentence vectors in different languages. Higher cosine similarity indicates closer semantic meaning.

11. Relevance for Saree and Textile Research

Although this paper is about Indian-language NLP, it is very relevant for saree and textile research. Saree knowledge is deeply multilingual. Product names, craft clusters, weaving techniques, motifs, regional terms, GI descriptions, and customer reviews often appear in Hindi, Telugu, Tamil, Kannada, Bengali, Marathi, Malayalam, Gujarati, and English.

For example, the same textile idea may appear in many forms:

  • “Kanjivaram saree”
  • “Kanchi pattu”
  • “காஞ்சிபுà®°à®®் பட்டு”
  • “పట్à°Ÿు à°šీà°°”
  • “Banarasi silk saree”
  • “बनारसी साड़ी”

A resource ecosystem like IndicNLPSuite can support the language side of textile AI systems. Image models can classify saree images, while Indic NLP models can understand the descriptions, search queries, catalog fields, reviews, and craft documentation around those images.

IndicNLPSuite Resource Possible Textile / Saree Use
IndicCorp Build domain corpora from Indian-language craft articles, catalogs, blogs, and descriptions.
IndicFT Represent textile terms in Indian languages using word embeddings.
IndicBERT Understand multilingual saree descriptions and customer queries.
IndicGLUE Inspire benchmark tasks for textile-domain language understanding.
Cross-lingual sentence retrieval Retrieve equivalent craft descriptions across English and Indian languages.
Named Entity Recognition Extract craft names, place names, artisan clusters, material names, and brand names.

For a saree provenance system, this is important because provenance is not only visual. It is also linguistic, cultural, and regional. A multimodal system may need to combine image recognition with Indian-language text understanding.

12. Limitations and Future Scope

The paper makes a major contribution, but it also has some limitations. The resources cover 11 major Indian languages, not all Indian languages and dialects. India’s language diversity is much larger.

The monolingual corpus is primarily news-based. This is useful for general NLP, but domain-specific language such as textiles, crafts, legal documents, healthcare, agriculture, education, or retail may require additional fine-tuning.

IndicBERT uses a compact ALBERT architecture, which makes it practical, but future work could explore larger transformer models, better multilingual alignment, transliteration handling, and domain-specific adaptation.

Limitation Suggested Future Direction
11-language coverage Extend resources to more Indian languages and dialects.
News-heavy corpus Add domain-specific corpora such as textiles, crafts, education, healthcare, and government documents.
Limited transliteration focus Improve handling of Romanized Indian-language text and code-mixing.
Benchmark coverage Create more complex Indian-language reasoning, QA, and domain benchmarks.
Text-only focus Combine with image models for multimodal cultural heritage systems.

13. Simple Summary

This paper introduces IndicNLPSuite, a major NLP resource collection for Indian languages. It includes IndicCorp, a large monolingual corpus of about 8.8 billion tokens; IndicFT, FastText word embeddings trained on this corpus; IndicBERT, an ALBERT-based multilingual language model; and IndicGLUE, an Indian-language understanding benchmark.

The central idea is that Indian languages need dedicated resources because they are linguistically rich, script-diverse, and underrepresented in many general multilingual models. The paper shows that embeddings and models trained on IndicCorp perform competitively or better than existing multilingual baselines on many tasks.

For saree and textile research, this paper is valuable because it shows how Indian-language NLP can support multilingual textile search, craft documentation, product cataloging, customer review analysis, and multimodal saree provenance systems.

14. General Disclaimer

This article is an educational explanation of the research paper “IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages.” It is intended for conceptual understanding, academic discussion, and research learning. Some technical details have been simplified for readability. Readers interested in exact datasets, model training settings, licensing, and complete benchmark results should refer to the original paper and the released IndicNLP resources.

```

MuRIL: Multilingual Representations for Indian Languages

MuRIL: Multilingual Representations for Indian Languages

The paper “MuRIL: Multilingual Representations for Indian Languages” introduces MuRIL, a multilingual language model built specifically for Indian languages. The motivation is simple but important: India is one of the most multilingual societies in the world, yet many general-purpose multilingual models do not perform well enough on Indian-language tasks.

MuRIL stands for Multilingual Representations for Indian Languages. It is designed to handle Indian-language text written in native scripts as well as transliterated text written in Latin script. This is very important in India because people often write Hindi, Telugu, Tamil, Bengali, Kannada, Marathi, Urdu, and other Indian languages using English letters in informal digital spaces such as chats, comments, and social media.

1. Problem Addressed by the Paper

India has a very large number of languages and dialects. The paper notes that India has 1369 rationalized languages and dialects, 22 scheduled languages, and 121 languages with more than 10,000 speakers. Despite this linguistic richness and India’s large digital footprint, many existing multilingual language models perform poorly on Indian languages.

A major reason is that multilingual models such as mBERT are trained on more than 100 languages at the same time. This means Indian languages receive limited representation in training data and vocabulary. As a result, the model may not learn Indian-language grammar, morphology, vocabulary, and usage patterns deeply enough.

Core problem: General multilingual models do not adequately represent Indian languages, especially low-resource languages and transliterated Indian-language text.

2. Why MuRIL Was Needed

The paper argues that Indian languages need a language model that is trained with focused attention on Indian linguistic realities. These realities include multiple scripts, uneven digital resources, code-mixing with English, and transliteration into Latin script.

For example, a Hindi sentence may appear in Devanagari script, but the same sentence may also appear as Roman Hindi in everyday typing. A general model trained mostly on native-script text may not understand the Romanized version properly.

MuRIL addresses this by training on three important forms of data:

  • monolingual Indian-language text,
  • translated Indian-language and English document pairs, and
  • transliterated native-script and Latin-script document pairs.

This makes MuRIL more suitable for Indian-language understanding than a broad multilingual model that treats Indian languages as only a small part of a very large multilingual pool.

3. Languages Supported by MuRIL

MuRIL supports 17 languages in total: 16 Indian languages and English. The Indian languages covered in the paper are:

Language Code Script / Context
AssameseasEastern Indo-Aryan language
BengalibnBengali script
GujaratiguGujarati script
HindihiDevanagari script
KannadaknKannada script
KashmiriksLow-resource Indian language in the dataset context
MalayalammlMalayalam script
MarathimrDevanagari script
NepalineDevanagari script
Oriya / OdiaorOdia script
PunjabipaGurmukhi context in Indian NLP use
SanskritsaClassical Indian language
SindhisdIndian-language context
TamiltaTamil script
TeluguteTelugu script
UrduurPerso-Arabic script
EnglishenUsed for cross-lingual transfer and translation alignment

4. Training Data Used

The paper uses several types of data to train MuRIL. This is one of the most important strengths of the model.

Data Type Source Purpose
Monolingual data Common Crawl OSCAR corpus and Wikipedia Helps the model learn language structure, vocabulary, and usage.
Translated data PMINDIA parallel corpus and machine-translated documents Helps the model align Indian-language text with English.
Transliterated data Dakshina dataset and indic-trans transliteration Helps the model understand Indian-language text written in Latin script.

The use of translated and transliterated document pairs is especially important because it provides supervised cross-lingual signals during training.

5. Training Objectives: MLM and TLM

MuRIL is trained using two language-modeling objectives: Masked Language Modeling and Translation Language Modeling.

Masked Language Modeling

Masked Language Modeling, or MLM, is the standard BERT-style training objective. Some tokens in a sentence are masked, and the model learns to predict the missing tokens from context.

Conceptually:

\[ \text{Input: } \text{The saree is [MASK].} \]

\[ \text{Model learns to predict: beautiful, red, traditional, etc.} \]

MLM uses monolingual text and helps the model learn the structure of each language.

Translation Language Modeling

Translation Language Modeling, or TLM, uses parallel text pairs. These may be Indian-language and English pairs, or native-script and transliterated pairs. The model sees both sides together and learns cross-lingual alignment.

A simplified view is:

\[ \text{Hindi sentence} + \text{English translation} \rightarrow \text{shared contextual representation} \]

For transliteration:

\[ \text{Native script sentence} + \text{Latin transliteration} \rightarrow \text{shared contextual representation} \]

This is important because it helps the model connect meaning across scripts and languages.

6. Why Transliteration Matters

Indian digital communication often uses transliteration. For example, someone may write Hindi, Telugu, Kannada, Bengali, or Tamil words using English letters. This is common in WhatsApp messages, social media posts, search queries, comments, product reviews, and informal customer feedback.

A model that only understands native scripts may fail on such data. MuRIL explicitly includes transliterated training examples, making it better suited for real Indian digital text.

Simple intuition: MuRIL is trained not only to understand “भारत” but also forms such as “bharat.” This makes it much more useful for Indian-language digital applications.

7. Upsampling Low-Resource Languages

The training corpus has uneven representation across languages. Some languages have much more available text than others. If the model is trained directly on the raw distribution, high-resource languages dominate and low-resource languages receive less learning attention.

To address this, the authors upsample low-resource languages using the following multiplier:

\[ m_i = \left( \frac{\max_{j \in L} n_j}{n_i} \right)^{1-\alpha} \]

Here, \(m_i\) is the multiplier for language \(i\), \(n_i\) is the token count for language \(i\), \(L\) is the set of languages, and \(\alpha\) is set to \(0.3\).

The upsampled token count becomes:

\[ m_i \times n_i \]

This gives smaller languages more representation during training while still preserving the overall multilingual structure.

8. Vocabulary and Tokenization

The paper places strong emphasis on vocabulary. MuRIL uses a cased WordPiece vocabulary learned from the upsampled pre-training data. The final vocabulary size is:

\[ 197,285 \]

This is much larger and more Indian-language-focused than the vocabulary representation available in mBERT for Indian languages.

The paper uses the idea of fertility ratio, which means the average number of subwords into which a word is split. A higher fertility ratio means a word is broken into more pieces, which may weaken semantic preservation.

For example, if a language model breaks one Indian-language word into many awkward fragments, it may struggle to understand the word as a meaningful unit.

Why MuRIL tokenization helps: MuRIL’s vocabulary contains better representation for Indian scripts and transliterated forms, so Indian-language words are split into fewer and more meaningful pieces than in mBERT.

9. Pre-training Details

MuRIL is trained as a BERT-base encoder model. The paper reports the following important pre-training details:

Aspect Reported Setting
Architecture BERT-base encoder
Objectives MLM and TLM
Maximum sequence length 512
Global batch size 4096
Training steps 1 million steps
Warm-up steps 50,000
Optimizer AdamW
Learning rate \(5 \times 10^{-4}\)
Parameters 236 million
Training tokens Approximately 16 billion unique tokens
Vocabulary size 197,285

10. Evaluation Method

The authors evaluate MuRIL on the XTREME benchmark, focusing on Indian-language test sets. The evaluation is done in a zero-shot cross-lingual setting. This means the model is fine-tuned on English training data and then evaluated on Indian-language test data.

This setting is challenging because the model must transfer learning from English to Indian languages. Strong performance in this setting indicates better cross-lingual understanding.

The tasks include:

  • Named Entity Recognition, or PANX
  • Part-of-Speech tagging, or UDPOS
  • Natural Language Inference, or XNLI
  • Sentence retrieval, or Tatoeba
  • Question Answering using XQuAD, MLQA, and TyDiQA-GoldP

11. Results Compared with mBERT

MuRIL outperforms mBERT across all reported Indian-language XTREME tasks. The average score improves from 59.1 for mBERT to 68.6 for MuRIL on native-script Indian-language test sets.

Task mBERT MuRIL Interpretation
PANX NER F1 58.0 77.6 Large improvement in named entity recognition.
UDPOS F1 71.2 75.0 Improved syntactic tagging.
XNLI Accuracy 66.8 74.1 Improved cross-lingual reasoning.
Tatoeba Accuracy 18.4 25.2 Better sentence retrieval, though still challenging.
XQuAD F1 / EM 71.2 / 58.2 79.1 / 65.6 Improved question answering.
MLQA F1 / EM 65.3 / 51.2 73.8 / 58.8 Better multilingual QA performance.
TyDiQA-GoldP F1 / EM 63.1 / 51.7 75.4 / 59.3 Strong improvement on typologically diverse QA data.
Average 59.1 68.6 MuRIL performs better overall.

Performance on Transliterated Indian-Language Test Sets

The improvement is even stronger on transliterated test sets. On Indian-language text transliterated into Latin script, the average score improves from 21.1 for mBERT to 48.9 for MuRIL.

Task on Transliterated Test Sets mBERT MuRIL
PANX F1 14.2 57.7
UDPOS F1 28.2 62.1
XNLI Accuracy 39.2 64.7
Tatoeba Accuracy 2.7 11.0
Average 21.1 48.9

This result directly supports the paper’s argument: Indian-language models must handle transliteration because Indian users often type local languages using Latin script.

12. Qualitative Examples

The paper includes qualitative examples showing that MuRIL handles context better than mBERT in several cases.

Named Entity Recognition

In one example, the phrase “Atlanta Falcons” refers to a football team. MuRIL predicts it as an organization, while mBERT incorrectly treats Atlanta as a location. This shows that MuRIL uses context more effectively.

In another example, “Shirdi’s Sai Baba” is correctly treated by MuRIL as a person, while mBERT incorrectly leans toward location because of the word “Shirdi.”

Sentiment Analysis

The paper also shows examples where MuRIL correctly handles mixed-language and transliterated sentences. For instance, a Hindi sentence containing an English word and a negation is correctly interpreted by MuRIL.

Question Answering

In question answering, MuRIL is shown to connect native-script and transliterated references better. For example, when a concept appears in Hindi in the context but in transliterated form in the question, MuRIL is able to infer the answer correctly.

13. Relevance for Indian Textile and Saree Research

At first glance, MuRIL is an NLP paper, not a textile paper. But it is very relevant for Indian textile and saree research because saree knowledge is multilingual. Saree names, craft clusters, motifs, weaving techniques, fabric types, and customer descriptions often appear in Indian languages, English, and mixed forms.

For example, the same textile concept may appear as:

  • “Kanjivaram saree”
  • “Kanchipuram pattu”
  • “kanjivaram pattu saree”
  • “कांजीवरम साड़ी”
  • “pattu saree”
  • “పట్à°Ÿు à°šీà°°”

A model like MuRIL can help connect these language forms better than a general English-centric system.

MuRIL Concept Possible Textile / Saree Research Use
Indian-language representation Understand saree descriptions in Hindi, Telugu, Tamil, Kannada, Bengali, Marathi, and other languages.
Transliteration handling Process customer searches such as “pattu saree,” “banarasi,” “kanchi pattu,” or “pochampally ikat.”
Cross-lingual alignment Map regional craft names across Indian languages and English.
Named entity recognition Identify place names, craft clusters, textile types, motif names, and brand names from text.
Question answering Build textile knowledge assistants that answer questions from multilingual documents.
Sentiment analysis Analyze customer reviews written in mixed Indian languages and English.

For saree provenance research, MuRIL can support the text side of a multimodal system. Image models can analyze motifs and fabric appearance, while MuRIL can process product descriptions, craft documentation, GI descriptions, artisan narratives, and customer queries.

14. Limitations and Future Scope

MuRIL is a strong contribution, but the paper also has practical boundaries. It currently supports 16 Indian languages plus English, not all Indian languages and dialects. India’s linguistic diversity is far larger than the supported set.

Another limitation is that the paper focuses on language understanding benchmarks. It does not directly test domain-specific use cases such as textiles, legal documents, medical records, education, or e-commerce product search.

For textile research, MuRIL would need to be further fine-tuned or combined with textile-specific vocabulary, saree descriptions, catalog data, craft cluster knowledge, and regional terminology.

Limitation Suggested Future Direction
Limited language coverage Extend to more Indian languages, dialects, and scripts.
Benchmark-focused evaluation Evaluate on domain-specific tasks such as e-commerce, crafts, healthcare, law, or education.
Text-only model Combine with image models for multimodal Indian-language applications.
General vocabulary Fine-tune on textile, saree, craft, and cultural heritage corpora.
Transliteration variability Handle multiple informal spellings of the same Indian-language word.

15. Simple Summary

MuRIL is a multilingual language model created specifically for Indian languages. It addresses a major weakness of general multilingual models: Indian languages are often underrepresented in their training data and vocabulary.

The model is trained on monolingual, translated, and transliterated data. It supports 16 Indian languages plus English and uses both Masked Language Modeling and Translation Language Modeling. It has a vocabulary of 197,285 tokens, 236 million parameters, and is trained on approximately 16 billion unique tokens.

Compared with mBERT, MuRIL performs better on Indian-language XTREME tasks. The improvement is especially large on transliterated Indian-language text, where users write Indian languages using Latin script. This makes MuRIL highly useful for real-world Indian digital language applications.

For saree and textile research, MuRIL can help process multilingual product descriptions, customer reviews, craft documentation, regional terminology, and transliterated search queries. It can become the language component of a larger multimodal saree understanding system.

16. General Disclaimer

This article is an educational explanation of the research paper “MuRIL: Multilingual Representations for Indian Languages.” It is intended for conceptual understanding, academic discussion, and research learning. Some technical details have been simplified for readability. Readers interested in implementation details, model usage, full per-language results, and exact training configuration should refer to the original paper and the released MuRIL resources.

```

Understanding the Paper: Content-Based Image Retrieval of Indian Traditional Textile Motifs Using Deep Feature Fusion

Content-Based Image Retrieval of Indian Traditional Textile Motifs Using Deep Feature Fusion

The paper “Content-based image retrieval of Indian traditional textile motifs using deep feature fusion” presents an interactive image retrieval system for Indian traditional textile motifs. Instead of asking the user to search by keywords, the system allows a user to provide a query image and then retrieves visually similar textile motif images from a database.

This is important because traditional Indian textile motifs are visually rich and difficult to describe fully in words. Motif identity may depend on color, line quality, texture, shape, ornamentation, and regional design grammar. A keyword-based search may fail to capture these details. A content-based image retrieval system, or CBIR system, uses the visual content of the image itself.

1. Problem Addressed by the Paper

Traditional Indian textile designs contain complex visual details. Motifs from styles such as Madhubani, Kalamkari, Ajrakh, Bagh, Kashida, Chikankari, Bandhani, Ikat, and Warli carry cultural identity and design value. Designers often need to search existing motif databases to create new designs, study references, or combine traditional motifs with contemporary fashion ideas.

However, manual searching is slow. Keyword-based image retrieval is also limited because many visual features are difficult to express in words. For example, the curved line style of Kalamkari, the dotted-resist appearance of Bandhani, or the geometric rhythm of Ikat may not be captured accurately through simple tags.

The paper therefore proposes a content-based image retrieval system that retrieves similar motif images by comparing image features rather than depending only on text labels.

Core problem: Can an AI-based retrieval system help designers quickly find visually similar traditional Indian textile motifs from a large image database?

2. What is Content-Based Image Retrieval?

Content-Based Image Retrieval, or CBIR, is a system that retrieves images based on visual similarity. A user provides a query image. The system extracts a feature vector from the query image and compares it with feature vectors stored for all database images.

The system then retrieves images whose feature vectors are closest to the query image feature vector.

The general CBIR pipeline can be written as:

\[ \text{Query Image} \rightarrow \text{Feature Extraction} \rightarrow \text{Similarity Matching} \rightarrow \text{Retrieved Images} \]

In older CBIR systems, features were often handcrafted, such as color histograms, texture descriptors, SIFT, Gabor, GLCM, or LBP. In this paper, the authors use deep features extracted from pre-trained convolutional neural networks.

3. Why Textile Motif Retrieval Matters

Textile motif retrieval is useful for designers, e-commerce platforms, researchers, museums, digital archives, and craft documentation projects. Designers can use such systems to search for visual references quickly. E-commerce platforms can improve product search. Researchers can compare motif families across regions and styles.

Use Case How CBIR Helps
Fashion design Designers can retrieve similar motifs and create new design variations faster.
Digital archives Museums and institutions can organize motifs by visual similarity.
E-commerce Customers can search products using images instead of keywords.
Craft preservation Traditional motifs can be documented and retrieved more systematically.
Research Researchers can study motif similarity, regional styles, and visual evolution.

4. Main Idea of the Proposed Method

The proposed system combines several ideas into one interactive CBIR framework:

  1. Extract deep features from textile motif images using pre-trained CNN models.
  2. Fuse features from InceptionResNetV2 and InceptionV3.
  3. Use distance metrics to compare query and database image features.
  4. Select Manhattan City Block distance as the best similarity measure.
  5. Use PCA to reduce feature dimensionality and speed retrieval.
  6. Use NAVRATTAN style clustering to search within a predicted motif class.
  7. Use relevance feedback so that users can refine retrieval results.
  8. Use SBFGSM saliency maps to explain which regions contribute to retrieval similarity.

The central contribution is not simply image classification. The goal is retrieval: given a motif image, the system should return visually similar motifs.

5. Traditional Indian Art Forms Dataset

The paper uses an expanded Traditional Indian Art Forms Dataset, abbreviated as TIAD. It contains 22,547 images across nine traditional Indian textile or art styles. The images are stored at \(300 \times 300\) pixel resolution.

Traditional Style Number of Images Visual Character
Bagh 2570 Block-printed motifs, often with strong repeat structures.
Bandhani / Bandhej 2668 Tie-dye dot patterns and resist-dyed visual rhythm.
Batik 3078 Wax-resist patterns with organic crackle and decorative forms.
Chikankari 2307 Embroidery-based motifs, often delicate and tonal.
Ikat 2724 Resist-dyed yarn patterns with blurred geometric edges.
Kalamkari 1502 Hand-drawn or block-printed narrative and floral motifs.
Kashida 2228 Embroidery motifs inspired by nature and regional ornamentation.
Madhubani 2280 Folk-art style with dense line work and symbolic figures.
Warli 3190 Tribal art style with geometric human, animal, and ritual forms.
Total 22,547 Nine traditional Indian art/textile styles.

The authors also test the method on benchmark CBIR datasets such as Corel-1K and Caltech-101 to compare performance with existing methods.

6. Deep Feature Fusion

The main feature-extraction method uses two pre-trained CNN models:

  • InceptionResNetV2
  • InceptionV3

The authors remove the final softmax classification layer and use high-level features from the models. InceptionResNetV2 provides a feature vector of dimension \(1536\), while InceptionV3 provides a feature vector of dimension \(2048\). These are concatenated to create one fused feature vector:

\[ F_{fusion} = [F_{IRV2}; F_{IV3}] \]

The final fused feature dimension is:

\[ 1536 + 2048 = 3584 \]

This fusion is useful because each CNN architecture may capture different visual cues. One model may capture more shape-based features, while another may capture texture, line, or pattern composition more effectively.

Feature Source Feature Dimension Role in Retrieval
InceptionResNetV2 1536 Captures high-level deep visual features using inception and residual learning.
InceptionV3 2048 Captures multi-scale visual patterns through inception modules.
Fused representation 3584 Combines complementary visual information for stronger retrieval.


7. Similarity Measures

Once the feature vector of the query image and database images is available, the next task is similarity matching. The paper compares several distance or similarity measures:

  • Euclidean distance
  • Manhattan City Block distance
  • Jeffrey divergence
  • Tanimoto coefficient

Euclidean Distance

\[ d(Q,R) = \sqrt{\sum_{i=1}^{n}(Q_i - R_i)^2} \]

Manhattan City Block Distance

\[ d(Q,R) = \sum_{i=1}^{n}|Q_i - R_i| \]

Jeffrey Divergence

\[ d(Q,R) = \sum_{i=1}^{n} Q_i \log \frac{Q_i}{\mu_i} + R_i \log \frac{R_i}{\mu_i} \]

where:

\[ \mu = \frac{Q+R}{2} \]

Tanimoto Coefficient

\[ d(Q,R) = \frac{Q \cdot R}{\|Q\|^2 + \|R\|^2 - Q \cdot R} \]

The paper finds that the Manhattan City Block distance performs best overall on the TIAD dataset, with an overall average precision of 92.46%.

8. Precision and Recall

The system is evaluated using precision and recall. In CBIR, precision tells us how many retrieved images are actually relevant. Recall tells us how many relevant images in the database were retrieved.

Precision is defined as:

\[ Precision = \frac{\text{Number of relevant images retrieved}}{\text{Number of retrieved images}} \]

Recall is defined as:

\[ Recall = \frac{\text{Number of relevant images retrieved}}{\text{Total number of relevant images in the dataset}} \]

For example, if the system retrieves 20 images and 16 of them are relevant, then:

\[ Precision = \frac{16}{20} = 0.80 \]

If the database contains 50 relevant images in total, then:

\[ Recall = \frac{16}{50} = 0.32 \]

9. Retrieval Results

The proposed deep feature fusion approach performs strongly on the TIAD dataset. With a scope of 20 retrieved images, the system achieves an overall mean precision of 92.46% and mean recall of 19.51%.

TIAD Class Precision (%) Recall (%)
Bagh 93.24 23.13
Bandhej 92.29 17.95
Batik 92.02 15.12
Chikankari 89.12 18.00
Ikat 92.04 17.24
Kalamkari 95.00 26.46
Kashida 92.36 16.11
Madhubani 93.91 24.26
Warli 92.17 17.33
Mean 92.46 19.51

Among the TIAD classes, Kalamkari shows the highest precision at \(95.00\%\), while Chikankari shows the lowest precision at \(89.12\%\). This makes sense because Chikankari motifs can be subtle and tonal, making them harder to retrieve visually than stronger graphic motifs.

10. SBFGSM Explainability Method

The paper proposes a visualization method called Similarity-Based Fine-Grained Saliency Maps, abbreviated as SBFGSM. The purpose is to explain why certain retrieved images are considered similar to the query image.

In classification, saliency maps usually explain why an image was assigned a particular class label. In CBIR, the question is different: why was one image considered similar to another image? SBFGSM addresses this retrieval-specific explanation problem.

The basic idea is to mask small regions of the retrieved image and observe how the similarity score changes. If masking a region changes the similarity score significantly, that region is important for retrieval.

The importance score can be conceptually written as:

\[ K(Q,R,m_i) = \max\left(L_1(V_R \odot m_i, V_Q) - L_1(V_Q,V_R), 0\right) \]

Here, \(Q\) is the query image, \(R\) is the retrieved image, \(m_i\) is a binary mask, \(V_Q\) and \(V_R\) are feature vectors, \(L_1\) represents Manhattan distance, and \(\odot\) represents element-wise multiplication.

The resulting heatmap shows which image regions contributed most to the similarity decision. Brighter regions indicate stronger contribution.

Why SBFGSM matters: It helps users trust the retrieval system by showing which motif regions influenced the matching decision.

11. PCA for Faster Retrieval

The fused deep feature vector has 3584 dimensions. Searching through high-dimensional feature vectors can be computationally expensive. The authors therefore apply Principal Component Analysis, or PCA, to reduce dimensionality.

PCA tries to preserve maximum information while reducing the number of dimensions:

\[ F_{3584} \rightarrow F_{reduced} \]

For the TIAD dataset, the authors use 1024 principal components. For the Caltech-101 dataset, they use 100 principal components.

Using PCA slightly improves retrieval speed. On TIAD, retrieval time reduces from \(0.7435\) seconds without PCA to \(0.6586\) seconds with PCA. The precision remains strong and even slightly improves in the TIAD case.

Dataset With PCA Retrieval Time Without PCA Retrieval Time
TIAD 0.6586 seconds 0.7435 seconds
Caltech-101 0.5837 seconds 0.6874 seconds

The paper introduces a clustering-based retrieval strategy called NAVRATTAN Style Clustering. The name refers to the nine traditional styles in the TIAD dataset.

The idea is simple. Instead of searching the entire database, the model first predicts the class of the query image. Then it searches only within that class cluster. This reduces search space and improves retrieval speed.

The process can be summarized as:

\[ \text{Query Image} \rightarrow \text{Predicted Style Cluster} \rightarrow \text{Search Within Cluster} \rightarrow \text{Top Similar Images} \]

This improves both relevance and speed. The paper reports that NAVRATTAN clustering improves precision from 92.46% to 95.18%. With PCA, retrieval time improves from 0.6586 seconds to 0.4475 seconds.

Approach Average Retrieval Time with PCA Average Retrieval Time without PCA
Previous approach 0.6586 seconds 0.7435 seconds
NAVRATTAN clustering 0.4475 seconds 0.5494 seconds

The improvement is meaningful because a practical design-search system must be fast enough for interactive use.

13. Relevance Feedback

The paper also uses relevance feedback. This allows the user to mark retrieved images as relevant or non-relevant. The system then refines the retrieval results in further iterations.

This is useful because image retrieval has a semantic gap. The system may retrieve images that are visually close according to feature vectors, but the user may have a more specific intention. Relevance feedback helps reduce the gap between system similarity and user expectation.

The paper reports that relevance feedback improves retrieval efficiency over multiple iterations on the TIAD dataset:

Method Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4
InceptionV3 baseline 89.89% 92.03% 94.67% 95.14% 95.96%
InceptionResNetV2 baseline 90.30% 92.17% 95.03% 95.89% 96.24%
IRV2 + IV3 Proposed 95.18% 97.07% 98.03% 98.66% 99.00%

This result shows that user interaction can significantly improve retrieval relevance. The system becomes not only automatic but also adaptive to the user’s search intention.

14. Relevance for Saree and Textile Research

This paper is highly relevant for saree and textile research because saree identification is not only a classification problem. Often, the more useful question is: “Show me sarees or motifs similar to this one.”

For sarees, CBIR can support visual search based on motifs, borders, pallus, weave structures, color layouts, and regional design grammar. A customer, designer, or researcher could upload an image of a saree motif and retrieve visually similar examples from a digital archive.

Paper Concept Possible Saree Research Use
CBIR Retrieve visually similar sarees, motifs, borders, or pallus.
Deep feature fusion Capture complex motif and texture features more effectively than one model alone.
Manhattan distance Measure visual similarity between saree feature vectors.
SBFGSM Explain which motif regions influenced the retrieval match.
NAVRATTAN clustering Search within specific saree clusters such as Banarasi, Kanjivaram, Gadwal, Ikat, or Kalamkari.
Relevance feedback Allow users to refine results based on their design intention.

For saree provenance research, this approach can complement classification models. A classification model gives a label, while a retrieval model shows similar examples. That can make the system more useful for designers, merchandisers, researchers, and customers.

15. Limitations and Future Scope

The paper makes a strong contribution, but a few limitations should be considered. First, the dataset consists of motif-style images rather than full product-level saree images. Full saree retrieval may be more complex because body, border, pallu, blouse piece, folds, photography style, and background can all influence visual similarity.

Second, the system depends on deep feature representations from ImageNet-pretrained models. Although useful, such models may not fully understand textile-specific features unless fine-tuned or combined with domain-specific knowledge.

Third, relevance feedback improves performance but requires user interaction. This is beneficial for design search, but it may be less suitable when fully automatic large-scale retrieval is required.

Limitation Suggested Improvement
Motif-level dataset Extend to full saree images with body, border, and pallu regions.
Generic pre-trained features Fine-tune models on textile-specific datasets.
Style-level clustering Add finer clusters such as motif type, border style, weave, or region.
User feedback required Develop hybrid automatic and interactive retrieval modes.
Visual-only search Combine image retrieval with metadata, craft knowledge graphs, and text descriptions.

16. Simple Summary

This paper proposes an interactive content-based image retrieval system for Indian traditional textile motifs. The system retrieves visually similar motifs by comparing deep image features rather than relying only on keywords.

The method fuses features from InceptionResNetV2 and InceptionV3, creating a 3584-dimensional feature vector. It uses Manhattan City Block distance for similarity matching, PCA for faster retrieval, SBFGSM for explainability, NAVRATTAN style clustering for speed and relevance, and relevance feedback for user-driven refinement.

The system achieves strong retrieval performance on the Traditional Indian Art Forms Dataset, with mean precision of 92.46% using deep feature fusion and 95.18% using NAVRATTAN clustering. With relevance feedback, the proposed method reaches 99.00% retrieval efficiency after four iterations.

For saree and textile research, the paper is important because it shows how image retrieval can support design discovery, motif comparison, digital archives, e-commerce search, and cultural preservation.

17. General Disclaimer

This article is an educational explanation of the research paper “Content-based image retrieval of Indian traditional textile motifs using deep feature fusion.” It is intended for conceptual understanding, academic discussion, and research learning. Some technical details have been simplified for readability. Image-based retrieval should be treated as a design-support and research-support tool, not as a complete replacement for expert craft knowledge, historical interpretation, or textile-domain validation.

```

Understanding the Paper: Handloomed Fabrics Recognition with Deep Learning

Handloomed Fabrics Recognition with Deep Learning

The paper “Handloomed Fabrics Recognition with Deep Learning” presents an artificial-intelligence-based approach for distinguishing genuine handloom “gamucha” fabrics from powerloom imitations. The gamucha is a culturally important towel-like textile from Assam, India, commonly recognized by its white body, red borders, and woven motifs.

The study is important because authentic handloom products are often imitated by powerloom products and sold deceptively as handloom. This affects the livelihood of weavers, weakens consumer trust, and threatens cultural textile heritage. The authors therefore propose a computer-assisted recognition system that can help identify whether a gamucha is handloom or powerloom using image-based deep learning.

1. Problem Addressed by the Paper

The paper addresses a practical and culturally important problem: how to distinguish authentic handloom gamucha fabrics from powerloom imitations. Handloom products require skill, time, and traditional knowledge, while powerloom products can be produced faster and at lower cost. When powerloom products are sold as handloom, customers may be misled and handloom weavers may lose rightful market value.

The problem is difficult because handloom and powerloom gamuchas may look very similar. Even experts may need scientific support to confirm the loom type. Manual identification depends on fabric feel, weft uniformity, selvedge markings, thread type, and other subtle features. The authors therefore explore whether deep learning can learn these visual differences directly from fabric images.

Core problem: Can a deep-learning model automatically identify whether a gamucha is handloom or powerloom from close-up fabric images?

2. What is a Gamucha?

The gamucha is a traditional textile from Assam. It is usually a rectangular cloth, commonly woven in cotton, with a white body and red border or motifs. It has both practical and symbolic value. It is used in daily life, rituals, hospitality, cultural events, and Assamese identity expression.

The paper presents the gamucha as a heritage textile whose authenticity matters not only for trade but also for cultural preservation. The authors highlight that the handloom sector in Assam supports a large number of weaver families, many of them women. Therefore, protecting genuine handloom products has both economic and cultural importance.

3. Manual Differences Between Handloom and Powerloom Gamucha

The paper describes several features that experts use to distinguish handloom from powerloom gamuchas. These include fabric feel, weft uniformity, occasional lumps, selvedge markings, and thread type. The important point is that these differences are often subtle and may not be obvious to an ordinary buyer.

Feature Handloom Gamucha Powerloom Gamucha
Fabric feel Generally softer because of pure, handmade yarns. Often stiffer, sometimes because of synthetic or cheaper yarns.
Weft uniformity May show uneven pick-ups due to manual weaving. Usually shows more consistent pick-ups.
Occasional lumps May be present due to warp breakage and repair. Usually absent or less visible.
Selvedge markings May show distinct temple-like marks. Usually lacks those temple marks.
Thread type Often uses twisted threads or yarns. Often uses single untwisted yarns.

These features are spread across different parts of the gamucha, including the selvedge, short edge, inner body, and motif region. This makes the classification problem visually complex.

4. Main Idea of the Proposed Study

The authors build an image-based deep-learning recognition system for binary classification:

\[ \text{Input image} \rightarrow \text{Handloom or Powerloom} \]

The study first compares several established deep-learning architectures and then proposes a custom model called gamuch.AI. The goal is not merely to achieve high accuracy but also to create a model that is lightweight enough for practical deployment, including mobile-app usage.

The basic workflow is:

  1. Collect verified handloom and powerloom gamucha samples.
  2. Capture close-up images using smartphone cameras.
  3. Crop images into fixed regions.
  4. Resize images to model input size.
  5. Apply image augmentation.
  6. Train deep-learning models.
  7. Compare model accuracy, loss, precision, sensitivity, specificity, and F1 score.
  8. Deploy the best model through a mobile application prototype.

5. Dataset Creation

The dataset was created from 200 gamucha pieces, with 100 handloom and 100 powerloom samples. The handloom samples were collected from weaving centers, while the powerloom samples included seized stock. The samples were validated by experts from the Department of Handloom and Textile, Government of Assam, which provides a reliable ground-truth basis.

Images were captured using two smartphone models: iPhone 12 and Xiaomi 11i. The authors captured images without flash and maintained a distance of approximately 5–10 cm from the fabric. This close-up image acquisition helped capture fabric surface, yarn, motif, and texture details.

The dataset preparation involved cropping each image into three equal square sections: top-left, bottom-right, and center. After resizing to:

\[ 224 \times 224 \]

the dataset contained 4371 images per class before augmentation.

Dataset Step Description
Samples collected 100 handloom gamuchas and 100 powerloom gamuchas.
Image capture 1457 images captured for each loom type.
Cropping Three square sections cropped from each image.
Resizing Images resized to \(224 \times 224\) pixels.
Before augmentation 4371 images per class.
After augmentation 7010 training images and 1732 validation images per class.
Total final dataset 17,484 images: 14,020 training and 3464 validation images.

6. Image Augmentation

Image augmentation was used to increase dataset diversity and improve model generalization. This is especially important because real-world fabric images may differ in orientation, brightness, zoom level, and capture condition.

Augmentation Technique Purpose
Rescaling Normalizes RGB pixel values from \([0,255]\) to \([0,1]\).
Rotation Rotates images up to \(180^\circ\) to simulate different orientations.
Vertical flip Allows the model to handle upside-down fabric orientation.
Horizontal flip Allows the model to handle mirror-image orientation.
Brightness variation Simulates different lighting conditions.
Zoom Helps the model focus on fabric loop and yarn structure.

7. Deep Learning Models Compared

The paper compares six well-known deep-learning architectures:

  • VGG16
  • VGG19
  • ResNet50
  • InceptionV3
  • InceptionResNetV2
  • DenseNet201

Each model was trained to classify gamucha images into handloom or powerloom. The purpose of the comparison was to understand whether standard pre-trained architectures can solve the problem or whether a lighter custom architecture is more suitable.

Model Why It Was Considered
VGG16 / VGG19 Simple CNN architectures with small \(3 \times 3\) filters and strong image-classification history.
ResNet50 Uses residual connections to handle deeper networks and reduce vanishing-gradient problems.
InceptionV3 Uses multi-scale filters and efficient factorized convolutions.
InceptionResNetV2 Combines inception modules with residual connections.
DenseNet201 Uses dense connections to improve feature reuse and information flow.

8. The Proposed Model: gamuch.AI

The proposed model, called gamuch.AI, is inspired by VGG16 but modified to become lighter and more suitable for the specific handloom-versus-powerloom classification task.

The authors first tried VGG16 without modification, but its accuracy remained around 50–56%. They then tried transfer learning by replacing the last dense layers, but this caused overfitting. A second modification with data augmentation and dropout reduced overfitting but introduced bias. Finally, the authors trained a simplified VGG16-like model from scratch.

The final model removes the fifth convolutional block of original VGG16. It uses the first four convolutional blocks, followed by average pooling and dense layers. This reduces complexity while retaining enough visual information from the fabric images.

Design Choice Reason
Input size \(224 \times 224\) Balances image information and model simplicity.
Removal of fifth VGG16 block Reduces complexity and computational cost.
Average pooling Condenses learned feature maps efficiently.
Dense layers Perform final binary classification.
Training from scratch Allows the model to learn task-specific fabric features.
Data augmentation Reduces overfitting and improves generalization.

A simplified classification function can be written as:

\[ \phi_{DLar}: G \rightarrow S \]

Here, \(G\) represents the set of gamucha images, and \(S\) represents the possible prediction scores or class labels.

9. Binary Cross-Entropy Loss

Since the problem is binary classification, the authors use Binary Cross-Entropy loss. The two classes are:

  • positive class: handloom gamucha
  • negative class: powerloom gamucha

The binary cross-entropy loss can be written as:

\[ BCE = -\frac{1}{N}\sum_{i=1}^{N} \left[ y_i \log(p_i) + (1-y_i)\log(1-p_i) \right] \]

Here, \(N\) is the number of training samples, \(y_i\) is the actual class label, and \(p_i\) is the predicted probability. This loss penalizes wrong predictions and helps the model learn to distinguish handloom from powerloom.

10. Mobile Application Workflow

A useful part of the paper is the mobile application prototype. The authors developed an initial app using Flutter. The app is designed so that a user can capture images of a gamucha and receive a prediction.

The workflow is:

  1. The app captures three pictures of the gamucha using the phone camera.
  2. The user clicks “Analyse now.”
  3. The images are uploaded to a processing server through an API.
  4. The server crops and resizes the images.
  5. Each processed image is passed through the trained AI model.
  6. The model produces separate predictions for each image.
  7. The final result is calculated from the highest predicted result across the three images.
  8. The prediction is displayed to the user.
  9. No image is stored permanently on the server; uploaded images are removed after processing.

This mobile workflow makes the study practically meaningful. The goal is not only to publish a model but also to create a deployable tool that can assist authentication in the field.

11. Results and Interpretation

The final proposed model achieved strong results. After the third modification, the model achieved a training accuracy of 98.47% and validation accuracy of 94.39% after 43 epochs. It was then tested on unknown images: 25 images of each type, augmented to 100 images of each type. The reported test accuracy was 98.0%.

The confusion matrix showed that some powerloom gamuchas were wrongly classified as handloom. The authors inspected those images and found that they were blurry. This means image focus is important for reliable prediction.

Evaluation Stage Reported Result
Training accuracy of final model 98.47%
Validation accuracy of final model 94.39%
Test accuracy on unknown augmented samples 98.0%
Model size 41.44 MB
Number of parameters 10,846,914
Model depth 19
Main result: The custom lightweight gamuch.AI model outperformed larger pre-trained models in practical validation performance and computational efficiency.

12. Model Comparison

The paper compares the proposed model against several large CNN architectures. Although some models achieved high training accuracy, their validation performance was weaker. This suggests that larger models may overfit or may not generalize well to this specific loom-type classification task.

Model Training Accuracy Validation Accuracy Model Size Parameters
VGG16 0.528 0.500 512.26 MB 134,268,738
VGG19 0.546 0.500 532.53 MB 139,578,434
ResNet50 0.981 0.768 1722.54 MB 451,423,106
InceptionV3 0.985 0.685 948.06 MB 248,311,586
InceptionResNetV2 0.990 0.912 873.34 MB 228,416,738
DenseNet201 0.951 0.641 1605.81 MB 420,467,266
Proposed Model 0.985 0.944 41.44 MB 10,846,914

The proposed model also performs strongly on evaluation metrics:

Model Precision F1 Score Sensitivity Specificity
VGG16 0.000 0.000 1.000 0.000
VGG19 0.000 0.000 1.000 0.000
ResNet50 0.705 0.744 0.671 0.788
InceptionV3 0.614 0.761 0.371 1.000
InceptionResNetV2 0.965 0.764 0.977 0.633
DenseNet201 0.822 0.901 0.784 0.997
Proposed Model 0.895 0.943 0.883 0.997

A key insight from the comparison is that bigger models are not automatically better. For this specific task, a smaller task-specific model performed better than larger pre-trained architectures.

13. Relevance for Saree and Textile Research

Although this paper focuses on Assamese gamucha, it is highly relevant for saree and textile classification. The core challenge is similar: how to distinguish authentic traditional textiles from look-alike products using visual evidence.

For saree research, the same logic can be extended to distinguish handloom from powerloom sarees, identify regional craft clusters, or support authenticity screening for products such as Banarasi, Kanjivaram, Gadwal, Paithani, Ilkal, Kota, and Mangalagiri sarees.

Paper Concept Possible Saree Research Use
Handloom vs powerloom classification Can support saree authenticity screening.
Close-up fabric image capture Useful for identifying weave, yarn, and texture differences.
Lightweight CNN model Suitable for mobile-based textile identification tools.
Expert-validated dataset Shows the importance of reliable ground-truth labeling.
Mobile application prototype Suggests practical deployment beyond academic experiments.

For a saree-origin identification project, this paper also reinforces an important point: model performance depends heavily on image quality, dataset design, region selection, and whether the model is suitable for the specific textile problem. A general-purpose large CNN may not always outperform a smaller model designed for the textile task.

14. Limitations and Future Scope

The study is strong because it is practical, dataset-driven, and deployment-oriented. However, a few limitations should be considered. The dataset is focused on gamucha and on a binary classification problem: handloom versus powerloom. Saree classification may be more complex because sarees have multiple components such as body, border, pallu, motifs, zari, blouse piece, and regional design grammar.

The model’s misclassification of some blurry powerloom images as handloom also shows that image focus and capture quality are critical. For real-world deployment, the mobile application should include image-quality checks before prediction.

Future work can expand the dataset, include more textile types, test more lighting conditions, include microscopic weave images, and combine image analysis with physical textile testing.

Limitation Suggested Improvement
Binary classification only Extend to multiple handloom types, regions, and product categories.
Image blur affects prediction Add automatic image-quality and focus checks.
Gamucha-specific dataset Test transferability to sarees, dupattas, towels, and other handloom fabrics.
Visual-only authentication Combine with yarn, weave, fiber, and expert validation methods.
Limited real-world testing Deploy and validate in field conditions with retailers, inspectors, and weavers.

15. Simple Summary

This paper proposes an AI-based method to distinguish authentic handloom gamucha from powerloom imitations. The authors created a dataset of 17,484 images and compared six established deep-learning architectures with a custom lightweight model called gamuch.AI.

The proposed model is inspired by VGG16 but simplified by removing the fifth convolutional block and adding average pooling and dense layers. This makes it smaller, faster, and more suitable for mobile deployment. It achieves 94.39% validation accuracy and 98% test accuracy on unknown augmented samples.

For textile and saree research, the paper is valuable because it shows that task-specific deep learning can support authentication, heritage protection, and practical deployment through mobile applications. It also reminds us that expert-validated datasets and good image capture are essential for reliable textile AI systems.

16. General Disclaimer

This article is an educational explanation of the research paper “Handloomed Fabrics Recognition with Deep Learning.” It is intended for conceptual understanding, academic discussion, and research learning. The explanations simplify some technical details for readability. Image-based authentication should be treated as a support tool and not as a complete replacement for expert textile examination, laboratory testing, certification systems, or legal verification.

```

Understanding the Paper: Drishtikon

DRISHTIKON: A Multimodal Multilingual Benchmark for Indian Cultural Understanding The paper “DRISHTIKON: A Multimodal Multilingual Benchm...