Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models

This study conducts a thorough evaluation of Gemini Pro's efficacy in commonsense reasoning tasks, employing a diverse array of datasets that span both language-based and multimodal scenarios.

The official repository contains datasets, model descriptions, the full set of prompts used in experiments, and corresponding experimental results.

Datasets

We experiment with 12 datasets related to different types of commonsense reasoning, which include 11 language-based datasets and one multimodal dataset. For evaluation purposes, we utilize the validation set corresponding to each task. The following figure provides an overview of the datasets, as well as example questions.

All datasets used for experiments can be downloaded in the "./datasets" folder except VCR (Visual Commonsense Reasoning), which can be accessed through here.

Models

We consider four popular LLMs for language-based dataset evaluation, including the opensource model Llama-2-70b-chat, as well as the closed-source models Gemini Pro, GPT-3.5 Turbo, and GPT-4 Turbo. Specifically, we query Gemini through Google Vertex AI, the GPT models through the OpenAI API, and Llama2 through DeepInfra.

For the multimodal dataset, we consider GPT-4V (gpt-4-vision-preview in API) and Gemini Pro Vision (gemini-pro-vision in API) in our experiments.

For all models, we apply greedy decoding (i.e., temperature = 0) for response generation.

Prompts

We evaluate language-based datasets using two prompting settings: (1) zero-shot standard prompting, which measures the models' inherent commonsense capabilities in linguistic contexts, and (2) few-shot chain-of-thought (CoT) prompting, designed to observe potential improvements in the models' performance. For the multimodal dataset, we employ zero-shot standard prompting to assess the genuine end-to-end visual commonsense reasoning abilities of multimodal large language models. The full set of prompts is available in the "./results" folder, where each .csv file contains 0shot_SP and 5shot_CoT, representing (1) and (2) respectively. In the multimodal VCR dataset, only 0shot_SP is used, in accordance with our setup.

Experimental Results

The experimental results for each dataset can be found in "./results".

Please refer to our full paper for more details.

Key Findings

(1) Overall, Gemini Pro’s performance is comparable to that of GPT-3.5 Turbo, demonstrating marginally better average results across 11 language datasets (1.4% higher accuracy), though it lags behind GPT-4 Turbo by an average of 8.2% in accuracy. Moreover, Gemini Pro Vision exhibits lower performance than GPT-4V on the multimodal dataset, except for temporal-related questions.

(2) Approximately 65.8% of Gemini Pro’s reasoning processes are evaluated as logically sound and contextually relevant, indicating its potential for effective application in various domains.

(3) Gemini Pro encounters significant challenges in temporal and social commonsense reasoning, indicating key areas for further development.

(4) Our manual error analysis reveals that Gemini Pro often misunderstands provided contextual information, accounting for 30.2% of its total errors. Furthermore, Gemini Pro Vision struggles particularly with identifying emotional stimuli in images, especially those involving human entities, which constitutes 32.6% of its total errors.

Citation

If you find this work helpful, please consider citing as follows:

@article{wang2023gemini,
  title={Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models},
  author={Wang, Yuqing and Zhao, Yun},
  journal={arXiv preprint arXiv:2312.17661},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
datasets		datasets
image_sources		image_sources
results		results
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models

Datasets

Models

Prompts

Experimental Results

Key Findings

Citation

About

Releases

Packages

License

EternityYW/Gemini-Commonsense-Evaluation

Folders and files

Latest commit

History

Repository files navigation

Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models

Datasets

Models

Prompts

Experimental Results

Key Findings

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages