DRISHTIKON: A Multimodal Multilingual Benchmark for Indian Cultural Understanding

The paper “DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models’ Understanding on Indian Culture” introduces a new benchmark for evaluating whether modern vision-language models can understand Indian culture through both images and text. The word Drishtikon means “perspective” or “point of view,” which is appropriate because the benchmark tests how AI systems perceive and reason about Indian cultural contexts.

The paper argues that many large language models and vision-language models perform well on general tasks but often struggle with culturally specific knowledge. This is especially important in India, where culture is expressed through many languages, scripts, regions, clothing traditions, cuisines, festivals, rituals, monuments, art forms, and local practices.

Problem Addressed by the Paper
Why DRISHTIKON Was Needed
Main Contribution of the Paper
Dataset Construction Pipeline
Knowledge Curation and MCQ Generation
Cultural Categorization and Attribute Tagging
Reasoning-Based Question Augmentation
Multilingual Translation and Scale-up
Models Evaluated
Major Results and Findings
Zero-Shot vs Chain-of-Thought Prompting
Technical View of the Benchmark
Relevance for Saree, Textile, and Cultural Heritage Research
Limitations and Future Scope
Simple Summary
General Disclaimer

1. Problem Addressed by the Paper

The central problem addressed by the paper is that current AI systems are not always culturally aware. They may recognize objects in images, answer common questions, or translate text, but they may fail when the task requires understanding Indian cultural context.

For example, an AI model may recognize that an image contains a dance costume, a food item, or a monument, but it may not know the regional, ritual, historical, or cultural significance of that image. Similarly, it may perform better in English or Hindi but struggle with lower-resource Indian languages such as Sindhi, Konkani, Assamese, or Odia.

Core problem: Existing multimodal benchmarks do not adequately test whether AI models understand India’s cultural diversity across languages, regions, images, and reasoning tasks.

2. Why DRISHTIKON Was Needed

Existing benchmarks often test general visual understanding, multilingual reasoning, or global cultural knowledge. However, the paper argues that these benchmarks do not give enough fine-grained attention to India’s cultural complexity.

India has enormous cultural diversity across its states and union territories. Cultural knowledge is not only about national-level symbols. It includes regional festivals, folk traditions, food practices, attire, religious rituals, architecture, performing arts, historical personalities, and local heritage.

The authors therefore create a benchmark that brings together three dimensions:

Multimodal understanding: the model must interpret both image and text.
Multilingual understanding: the model must answer in multiple Indian languages.
Cultural reasoning: the model must understand region-specific Indian cultural context.

3. Main Contribution of the Paper

The paper’s main contribution is the creation of DRISHTIKON, a multimodal and multilingual benchmark centered on Indian culture. It contains image-question pairs translated across multiple Indian languages and designed to test both factual and reasoning-based cultural understanding.

Aspect	DRISHTIKON Contribution
Coverage	All 28 Indian states and 8 union territories.
Languages	15 languages including English and 14 Indian languages.
Dataset size	64,288 question-image-language triples.
Cultural themes	Festivals, attire, cuisine, folk arts, rituals, heritage, tourism, personalities, and more.
Question format	Multiple-choice questions with one correct answer and three distractors.
Reasoning types	General, commonsense cultural, multi-hop reasoning, and analogy questions.
Evaluation target	Vision-language models, including open-source, proprietary, reasoning-specialized, and Indic-aligned models.

4. Dataset Construction Pipeline

The paper presents a clear dataset creation pipeline. According to the workflow diagram in the paper, the process begins with knowledge curation and MCQ generation, moves through cultural categorization and tagging, adds reasoning-based augmentation, translates the data into Indian languages, and finally assembles the benchmark.

The pipeline can be represented as:

\[ \text{Knowledge Curation} \rightarrow \text{MCQ Generation} \rightarrow \text{Cultural Tagging} \rightarrow \text{Reasoning Augmentation} \rightarrow \text{Multilingual Translation} \rightarrow \text{Final Dataset} \]

This pipeline is important because cultural benchmarking cannot be done by simply collecting random images. The questions must be culturally meaningful, regionally balanced, linguistically accurate, and visually grounded.

5. Knowledge Curation and MCQ Generation

The authors curated cultural knowledge from sources such as national repositories, state tourism portals, academic collections, and curated crowdsourced platforms. The content covers areas such as festivals, attire, cuisine, folk traditions, monuments, personalities, and other cultural markers.

The authors first created 2,126 English multiple-choice questions. Each question has one correct answer and three distractors. The distractors are not random. They are designed to test whether the model can resist plausible but incorrect options.

A typical MCQ includes:

one correct answer,
one semantically close distractor,
one option reflecting a common misconception, and
one unrelated but superficially similar option.

This makes the questions harder than simple recognition questions. A model cannot answer reliably only by detecting a broad object or keyword; it must understand the cultural association.

Important design choice: The authors use MCQs because they allow consistent scoring across many models and languages. Since each question has four options, random guessing has a chance level of \(25\%\).

6. Cultural Categorization and Attribute Tagging

Each question-image pair is tagged with one or more cultural attributes. These tags allow performance to be analyzed by cultural category. For example, researchers can check whether models perform better on cuisine than on rituals, or better on tourism than on folk arts.

The paper’s attribute chart shows the distribution of questions across cultural aspects. The largest category is Cultural Common Sense, followed by History, Rituals and Ceremonies, Tourism, Language, Dance and Music, and other themes.

Cultural Attribute	Approximate Question Count Reported
Art	3450
Costume	2280
Cuisine	4335
Cultural Common Sense	14085
Dance and Music	4455
Festivals	4153
History	11055
Language	4545
Medicine	195
Nightlife	30
Personalities	1110
Religion	1170
Rituals and Ceremonies	7005
Sports	270
Tourism	5745
Transport	405

This attribute tagging is one of the strengths of the benchmark because it allows fine-grained diagnosis of model weaknesses.

7. Reasoning-Based Question Augmentation

The authors did not stop at factual questions. They selected a balanced subset of 720 questions, approximately 20 per region, and converted them into deeper reasoning questions.

This produced 2,160 additional MCQs across three reasoning categories:

Reasoning Category	What It Tests	Example Type
Common Sense Cultural	Everyday cultural inference.	Matching attire, food, festival, or social practice with cultural context.
Multi-hop Reasoning	Linking multiple cultural facts.	Connecting a dance form to a festival and then to a state.
Analogy	Pattern matching across cultural examples.	Relating one state’s art form to another state’s equivalent cultural pattern.

This reasoning augmentation makes DRISHTIKON more than a visual recognition dataset. It becomes a test of cultural inference.

8. Multilingual Translation and Scale-up

To make the benchmark multilingual, the authors translated the questions into 14 Indian languages: Hindi, Bengali, Tamil, Telugu, Marathi, Kannada, Malayalam, Gujarati, Punjabi, Odia, Assamese, Urdu, Konkani, and Sindhi.

Together with English, this gives:

\[ 15 \text{ languages} \]

The full dataset contains:

\[ 64,288 \text{ question-image-language triples} \]

The authors used Gemini Pro for translation and then applied a two-stage human verification protocol on stratified samples to check meaning preservation, fluency, and cultural relevance.

For culturally specific terms that do not have direct equivalents in another language, the authors used transliteration or context-sensitive phrasing. This is important because Indian cultural words often cannot be translated literally without losing meaning.

9. Models Evaluated

The paper evaluates many types of vision-language models. This broad evaluation makes the benchmark useful because it compares small models, large models, proprietary systems, reasoning-specialized systems, and Indic-focused systems.

Model Category	Examples Evaluated	Purpose of Inclusion
Small open-source VLMs	SmolVLM-256M-Instruct, InternVL3-1B	Test whether compact models can perform well on cultural tasks.
Large open-source VLMs	Janus-Pro-7B, Qwen2-VL-7B-Instruct, LLaVA-1.6-Mistral-7B, InternVL3-14B, Gemma-3-27B-IT, Qwen2.5-Omni-7B	Test whether larger scale improves cultural reasoning.
Proprietary VLMs	GPT-4o-mini	Compare against a strong commercial model.
Reasoning-specialized VLMs	Kimi-VL-A3B-Thinking	Test whether reasoning-focused models handle cultural questions better.
Indic-aligned models	Chitrarth, Maya	Evaluate models designed with Indian or multilingual contexts in mind.

Accuracy is used as the primary evaluation metric:

\[ Accuracy = \frac{\text{Number of Correct Answers}}{\text{Total Number of Questions}} \]

10. Major Results and Findings

The paper reports several important findings. First, model size alone does not guarantee better cultural understanding. Some compact instruction-tuned models perform surprisingly well, while some larger models show unstable results.

Second, proprietary models such as GPT-4o-mini perform strongly across languages and question types. This suggests that broad instruction tuning and strong multimodal alignment help in cultural tasks.

Third, Maya, an Indian-origin or Indic-aligned model, performs competitively, showing the value of regionally focused AI development.

Fourth, model performance varies significantly by language. English, Hindi, Bengali, and Marathi tend to be easier for models, while Sindhi, Konkani, Kannada, Assamese, and Odia show more difficulty in several cases. This reflects the digital-resource imbalance across Indian languages.

Research Question	Main Finding
Does model scale predict performance?	No. Larger models are often strong, but smaller well-aligned models can outperform bigger models on cultural tasks.
Do models perform equally across languages?	No. High-resource languages generally perform better than low-resource Indian languages.
Which question types are hardest?	Multi-hop reasoning and analogy questions are harder than general and commonsense cultural questions.
Do Indic-focused models help?	Some Indic-focused models, especially Maya, show strong promise, but not all Indic-aligned models perform equally well.
Does Chain-of-Thought help?	Yes, especially for reasoning-heavy questions, but gains vary across model types and languages.

Language-Level Performance Pattern

The paper’s language-wise chart shows that overall average accuracy is highest for Gujarati, Hindi, and English among the listed languages, while Kannada and Sindhi appear among the most difficult. This does not mean those cultures are inherently harder. It means current models likely have less reliable exposure, training data, or alignment for those language-cultural combinations.

Regional Performance Pattern

The radar plots show uneven state-wise performance. Regions with stronger media visibility or more widely represented cultural signatures, such as Kerala, Gujarat, and West Bengal, tend to show more consistent performance. Smaller or less-represented regions such as Lakshadweep, Mizoram, and Dadra and Nagar Haveli show weaker results.

11. Zero-Shot vs Chain-of-Thought Prompting

The paper compares zero-shot prompting with Chain-of-Thought prompting. In zero-shot prompting, the model answers directly without being given examples. In Chain-of-Thought prompting, the model is encouraged to reason step by step before selecting the answer.

Chain-of-Thought prompting can be written conceptually as:

\[ \text{Image} + \text{Question} + \text{Options} \rightarrow \text{Reasoning Steps} \rightarrow \text{Answer} \]

The paper finds that Chain-of-Thought prompting helps most in reasoning-intensive categories such as multi-hop and analogy questions, with gains reported up to approximately \(10\%-15\%\) in some settings. However, the improvement is not uniform across all models and languages.

Important insight: Chain-of-Thought helps cultural reasoning, but it does not fully solve the problem of low-resource language gaps or culturally specific visual understanding.

12. Technical View of the Benchmark

From a machine-learning perspective, DRISHTIKON can be understood as a multimodal multiple-choice evaluation dataset.

Each instance can be represented as:

\[ D_i = (I_i, Q_i^{(l)}, O_i, y_i, A_i, R_i, T_i) \]

where:

\(I_i\) is the image,
\(Q_i^{(l)}\) is the question in language \(l\),
\(O_i = \{o_1,o_2,o_3,o_4\}\) is the set of answer options,
\(y_i\) is the correct option,
\(A_i\) is the cultural attribute tag,
\(R_i\) is the region or state/UT tag, and
\(T_i\) is the question type.

A vision-language model must estimate:

\[ \hat{y}_i = \arg\max_{o_j \in O_i} P(o_j \mid I_i, Q_i^{(l)}) \]

The final accuracy is:

\[ Accuracy = \frac{1}{N}\sum_{i=1}^{N} \mathbf{1}(\hat{y}_i = y_i) \]

This formulation shows why DRISHTIKON is useful. It allows accuracy to be sliced by language, region, cultural theme, model type, and question type.

13. Relevance for Saree, Textile, and Cultural Heritage Research

This paper is highly relevant for saree and textile research because sarees are not only visual products; they are cultural objects. A saree’s meaning may depend on region, weaving cluster, ritual context, community use, motif symbolism, language, and heritage association.

For example, a model trained only on product images may identify color or pattern, but it may not understand why a Kanjivaram saree, Paithani saree, Mekhela Chador, Bandhani, Patola, Kasavu, Banarasi brocade, or Baluchari design has specific cultural meaning.

DRISHTIKON Concept	Possible Saree / Textile Research Use
Multimodal benchmarking	Evaluate models using both saree images and textile descriptions.
Multilingual questions	Test saree knowledge in Hindi, Telugu, Tamil, Kannada, Bengali, Gujarati, Marathi, Malayalam, and other languages.
Cultural attribute tags	Create textile categories such as weave, motif, region, ritual use, pallu, border, and craft cluster.
State-wise coverage	Build region-wise saree provenance datasets across Indian weaving clusters.
Reasoning-based questions	Ask deeper questions such as why a motif, border, or drape style belongs to a particular tradition.
Chain-of-Thought evaluation	Check whether models can explain textile classification rather than only predict a label.

For a saree provenance classification project, DRISHTIKON suggests an important direction: evaluation should not be limited to image classification accuracy. A stronger benchmark could ask whether the model understands the relationship between image features, regional craft identity, local terminology, and cultural meaning.

14. Limitations and Future Scope

The paper is ambitious and important, but it also acknowledges limitations. India’s cultural diversity is extremely large, so even a benchmark covering 15 languages and all states and union territories cannot capture every dialect, local practice, community tradition, or regional nuance.

Another limitation is that the dataset uses curated image-text pairs. This allows controlled evaluation, but real-world cultural understanding is often messier. Images may be ambiguous, mixed, poorly labeled, or used in changing social contexts.

The paper also shows that many models still struggle with abstract analogy and multi-hop reasoning. This suggests that cultural AI needs better reasoning frameworks, better multilingual representation, and more balanced regional data.

Limitation	Possible Future Direction
Incomplete cultural coverage	Expand to more dialects, local practices, oral traditions, and community-specific knowledge.
Curated image-text setting	Test on real-world images, social media, e-commerce listings, and archival materials.
MCQ-only format	Add open-ended answering and explanation-based evaluation.
Language imbalance	Create more data for low-resource Indian languages.
Reasoning weakness	Develop culturally grounded reasoning datasets and fine-tuning methods.
Image URL dependence	Ensure long-term accessibility and licensing clarity for cultural image resources.

15. Simple Summary

DRISHTIKON is a multimodal and multilingual benchmark created to test whether AI models understand Indian culture. It contains culturally grounded image-question pairs across 15 languages and all Indian states and union territories.

The dataset begins with 2,126 English MCQs, adds 2,160 reasoning-augmented MCQs, translates them into 14 Indian languages, and produces 64,288 question-image-language triples. Each item includes an image, a question, four answer options, one correct answer, and metadata such as cultural attribute, region, language, and question type.

The paper evaluates many vision-language models and finds that current models still have major gaps. GPT-4o-mini performs strongly, compact models such as SmolVLM and InternVL3-1B are surprisingly competitive, and the Indian-origin Maya model shows promise. However, performance remains uneven across languages, regions, and reasoning types.

For saree and textile research, the paper is important because it shows how cultural understanding can be benchmarked in a multimodal way. A future saree AI system should not only identify images but also understand regional identity, textile terminology, craft heritage, and cultural context.

16. General Disclaimer

This article is an educational explanation of the research paper “DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models’ Understanding on Indian Culture.” It is intended for conceptual understanding, academic discussion, and research learning. Some technical details have been simplified for readability. Cultural interpretation should always be treated with care, and AI-based cultural understanding should support, not replace, community knowledge, expert scholarship, and lived cultural experience.

```

My Research Notes

Saturday, 6 June 2026

Understanding the Paper: Drishtikon