This study conducts a thorough evaluation of Gemini Pro's efficacy in commonsense reasoning tasks, employing a diverse array of datasets that span both language-based and multimodal scenarios.
The official repository contains datasets, model descriptions, the full set of prompts used in experiments, and corresponding experimental results.
We experiment with 12 datasets related to different types of commonsense reasoning, which include 11 language-based datasets and one multimodal dataset. For evaluation purposes, we utilize the validation set corresponding to each task. The following figure provides an overview of the datasets, as well as example questions.
All datasets used for experiments can be downloaded in the "./datasets" folder except VCR (Visual Commonsense Reasoning), which can be accessed through here.
We consider four popular LLMs for language-based dataset evaluation, including the opensource model Llama-2-70b-chat, as well as the closed-source models Gemini Pro, GPT-3.5 Turbo, and GPT-4 Turbo. Specifically, we query Gemini through Google Vertex AI, the GPT models through the OpenAI API, and Llama2 through DeepInfra.
For the multimodal dataset, we consider GPT-4V (gpt-4-vision-preview in API) and Gemini Pro Vision (gemini-pro-vision in API) in our experiments.
For all models, we apply greedy decoding (i.e., temperature = 0) for response generation.
We evaluate language-based datasets using two prompting settings: (1) zero-shot standard prompting, which measures the models' inherent commonsense capabilities in linguistic contexts, and (2) few-shot chain-of-thought (CoT) prompting, designed to observe potential improvements in the models' performance. For the multimodal dataset, we employ zero-shot standard prompting to assess the genuine end-to-end visual commonsense reasoning abilities of multimodal large language models. The full set of prompts is available in the "./results" folder, where each .csv file contains 0shot_SP and 5shot_CoT, representing (1) and (2) respectively. In the multimodal VCR dataset, only 0shot_SP is used, in accordance with our setup.
The experimental results for each dataset can be found in "./results".
Please refer to our full paper for more details.
(1) Overall, Gemini Pro’s performance is comparable to that of GPT-3.5 Turbo, demonstrating marginally better average results across 11 language datasets (1.4% higher accuracy), though it lags behind GPT-4 Turbo by an average of 8.2% in accuracy. Moreover, Gemini Pro Vision exhibits lower performance than GPT-4V on the multimodal dataset, except for temporal-related questions.
(2) Approximately 65.8% of Gemini Pro’s reasoning processes are evaluated as logically sound and contextually relevant, indicating its potential for effective application in various domains.
(3) Gemini Pro encounters significant challenges in temporal and social commonsense reasoning, indicating key areas for further development.
(4) Our manual error analysis reveals that Gemini Pro often misunderstands provided contextual information, accounting for 30.2% of its total errors. Furthermore, Gemini Pro Vision struggles particularly with identifying emotional stimuli in images, especially those involving human entities, which constitutes 32.6% of its total errors.
If you find this work helpful, please consider citing as follows:
@article{wang2023gemini,
title={Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models},
author={Wang, Yuqing and Zhao, Yun},
journal={arXiv preprint arXiv:2312.17661},
year={2023}
}