OVQA: A Dataset for Visual Question Answering and Multimodal Research in Odia Language

Overview

OVQA is the first multimodal dataset specifically designed for visual question answering (VQA), visual question elicitation (VQE), and multimodal research in the low-resource Odia language. It consists of 6,149 English-Odia question-answer pairs, each linked to unique images from the Visual Genome dataset, totaling 27,809 parallel sentences. The dataset ensures a semantic match between questions and corresponding visual content. Baseline experiments on VQA and VQE tasks demonstrate its potential. OVQA is a valuable resource for advancing multimodal research in Odia and can be extended to other low-resource languages.

Description

Statistics of the OVQA Dataset

Item	Count
Number of Images	6,149
Number of Questions	27,809
Number of Answers	27,809
Number of Wh-Questions	26,939
Number of Counting Questions	70
Others	800

Availability

The OVQA dataset is available at Lindat: http://hdl.handle.net/11234/1-5820.

Additionally, the OdiaVQA dataset, prepared for multimodal LLM training in an instruction set format, is available at Hugging Face: https://huggingface.co/datasets/odiagenmllm/odia_vqa_en_odi_set.

Acknowledgment

The work on this project was supported by the grant CZ.02.01.01/00/23\_020/0008518 of the Ministry of Education of the Czech Republic.

How to cite

@inproceedings{parida2025ovqa,
title = {{OVQA: A Dataset for Visual Question Answering and Multimodal Research in Odia Language}},
author = {Parida, Shantipriya and Sahoo, Shashikanta and Sekhar, Sambit and Sahoo, Kalyanamalini and Kotwal, Ketan and Khosla, Sonal and Dash, Satya Ranjan and Bose, Aneesh and Kohli, Guneet Singh and Lenka, Smruti Smita and Bojar, Ondřej},
year = {2025},
note = {Accepted at the IndoNLP Workshop at COLING 2025} }